What follows is a coded work through of Thomas Lumley’s “Complex Surveys: A Guide to Analysis in R” (Lumley 2011) and the methods underlying design based statistics more broadly. This write-up reflects my understanding of the material with additions made to try and clarify ideas further. These include simulations, derivations or references that I found helpful in working through Lumley’s material. Most of the datasets that I use throughout these notes are from Lumley’s website for the book.
It wasn’t very long into reading this book that I found that weighted
sampling in R is, unfortunately, not well set-up for complex designs.
Further details are in this stats exchange
post. It would not be possible, for example, to simply use the popular
tidyverse package dplyr’s function
slice_sample() to draw a weighted sample from a population
for example with appropriate inclusion probabilities. Instead a function
from the sampling
package would have to be used. I use this function below in any setting
where a non-uniform sample with inclusion probabilities is needed.
Its worth further pointing out that the topic of how samples
themselves are drawn is a complicated one its own right and that the
functions in the sampling package each have pros and cons
according to the target estimand or question. Drawing samples with
replicate weights – discussed in Chapter 2 – is a similarly complex
question which I haven’t yet resolved to my satisfaction.
Key to analysis of complex surveys is the idea of studying the design from which the data are constructed, rather than the data itself. In a traditional survey setting the data are assumed to be fixed and the probabilities of sampling different entities are used to derive the desired estimate. These inclusion probabilities or their inverse, “sampling weights”, are used to re-balance the data so that they more accurately reflect the target population distribution. Different sampling techniques — clustering, 2-phase, etc. — are used to either decrease the variance of the resulting estimate, the cost associated with the design or both.
The Horvitz Thompson Estimator (HTE) is the starting point for non-uniform random estimates. If we observe measure \(X_i\) on subject \(i\) of a population of \(N\) total subjects the HTE is formulated as follows:
\[ HTE(X) = \sum_{i=1}^{N} \frac{1}{\pi_i}X_i, \tag{1.1} \]
which is an unbiased estimator as shown in the Chapter 1 Appendix.
The variance of this estimate is then
\[ V[HTE(X)] = \sum_{i,j} \left ( \frac{X_i X_j}{\pi_{ij}} - \frac{X_i}{\pi_i} \frac{X_j}{\pi_j} \right ), \tag{1.2} \]
which follows from the Bernoulli covariance using indicator variables \(R_i=1\) if individual \(i\) is in the sample, \(R_i=0\) otherwise. A proof is provided in the Questions From Chapter 1.
(Kish 1965) defined the notion of a design effect as the ratio of a variance of an estimate in a complex sample to the variance of the same estimate in a simple random sample (SRS). The motivation for this entity being that it can guide researchers in terms of how much sample size they may need; If the sample size for a given level of precision is known for a simple random sample, the sample size for a complex design can be obtained by multiplying by the design effect.
While larger sample sizes may be necessary to maintain the same level of variance as a SRS, the more complex may still be more justified because of the lower cost associated. See (Meng 2018) for an example of where design effects are used in a modern statistical setting by comparing competing estimators.
From this point Lumley works through an introduction to the datasets used in the book and the idea that we’ll often be taking samples from datasets where we know the “true” population and computing estimates from there. This isn’t always the case and there may some subtlety worth discussing how to interpret results once we get into topics like regression, but for the most part his description makes sense.
One thing I found lacking in this introductory section is the motivation for why we might take non-uniform samples. It isn’t until Chapter 3 that Lumley discusses probability proportional to size (PPS) sampling, but this is very often the reason why a non-uniform sample is used.
If we have some measure that is right skewed in our population of interest and we’d like to estimate the mean, we could take a SRS to estimate the mean but the variance on that item would be lower than if we sampled proportional to the right skew measure itself. I’ll demonstrate with the following quick example, suppose we want to measure the income of a population. Incomes are often right skewed, but we can get a lower variance estimate if we take a weighted sample.
I generate a right skewed population and visualize the distribution.
population <- tibble(
id = 1:1E3,
income = 5E4 * rlnorm(1E3, meanlog = 0, sdlog = 0.5)
)
population %>%
ggplot(aes(x = income)) +
geom_histogram() +
geom_vline(aes(xintercept = mean(income)), linetype = 2, color = "red") +
ggtitle("Simulated Income Distribution") +
xlab("Income (USD)")
Here I’ll take a uniform and weighted sample of size 50. Note that the differences in the samples are subtle. They might not look all that different on visual inspection.
uniform_sample <- population %>%
slice_sample(n = 50) %>%
transmute(
income = income,
method = "uniform",
pi = 50 / 1E3
)
weighted_sample <- population %>%
mutate(
pi = sampling::inclusionprobabilities(floor(population$income), 50),
in_weighted_sample = sampling::UPbrewer(pi) == 1
) %>%
filter(in_weighted_sample) %>%
transmute(
income = income,
pi = pi,
method = "weighted"
)
rbind(
uniform_sample,
weighted_sample
) %>%
ggplot(aes(x = income, fill = method)) +
geom_histogram() +
ggtitle("Sample Comparisons") +
xlab("Income (USD)") +
theme(legend.title = element_blank(), legend.position = "top")
Finally I’ll estimate the population mean from both samples and include the design effect calculation in the weighted sample estimate.
uniform_sample %>%
as_survey_design() %>%
summarize(
mean_income = survey_mean(income)
)
## # A tibble: 1 × 2
## mean_income mean_income_se
## <dbl> <dbl>
## 1 52254. 3556.
weighted_sample %>%
as_survey_design(probs = pi) %>%
summarize(mean_income = survey_mean(income, deff = TRUE))
## # A tibble: 1 × 3
## mean_income mean_income_se mean_income_deff
## <dbl> <dbl> <dbl>
## 1 61049. 3615. 0.908
We see that the weighted estimate standard error is not quite half the uniform estimate. Accordingly the design effect for the weighted sample is less than 1.
1.1-1.2 Don’t make sense to reproduce here.
1.3 Each visit to the front page of a newspaper’s website has (independently) a 1/1000 chance of resulting in a questionnaire on voting intentions in a forthcoming election. Assuming that everyone who is given the questionnaire responds, why are the results not a probability sample of
Lumley lists 4 properties needed for a sample to be considered a probability sample.
Every individual (unit of analysis) in the population must have a non-zero probability of ending up in the sample (\(\pi_i>0 \forall i\))
\(\pi_i\) must be known for every individual who does end up in the sample.
Every pair of individuals in the sample must have a non-zero probability of both ending up in the sample (\(\pi_{i,j} \forall i, j\))
The probability \(\pi_{i,j}\) must be known for every pair that does end up in the sample.
1 is not guaranteed when considering voters — there are voters who don’t read the paper who have will have \(\pi_i = 0\) — or the broader heading of “readers” of the newspaper - since those who only read the physical paper will have a $_i = 0 $. For “readers of the newspaper’s online version” the sample would only be a probability sample if the time window was further specified, as there could be online readers who do not visit during the survey window, and would thus be assigned a \(\pi_i=0\).
1.4 You are conducting a survey that will estimate the proportion of women who used anti-malarial insecticide-treated bed nets every night during their last pregnancy. With a simple random sample you would need to recruit 50 women in any sub-population where you wanted a standard error of less than 5 percentage points in the estimate. You are using a sampling design that has given design effects of 2-3 for proportions in previous studies in similar areas.
Larger, a design effect \(>1\) indicates that the variance is larger in the complex design with the same sample size - consequently the sample size will need to be increased to maintain the same level of precision.
100 - 150. Derived from multiplying 50 by 2 and 3.
1.5 Systematic sampling involves taking a list of the population and choosing, for example, every 100th entry in the list.
Items 2-4 from the list enumerated above. The only condition that is not satisfied is that not every item has a nonzero probability of being chosen.
This satisfies all items from the above list.
Again, this satisfies all items from the above list.
Again, all of them. The key is adding the known randomness and not excluding any items from selection.
This would be because the items are not ordered in any particular fashion prior to taking the “systematic sampling”. In this setting a systematic sample is equivalent to a simple random sample.
1.6 Why must all the sampling probabilities be non-zero to get a valid population estimate?
If any of the sampling probabilities are zero, that would introduce bias in shifting the estimate away from the portion of the population that would always be unobserved under repeated sampling.
1.7 Why must all the pairwise probabilities be non-zero to get a valid uncertainty estimate.
This is basically a second order statement equivalent to the previous. If any pair is unable to be observed together that is a form of selection bias that would shift the sample estimate away from the true population value.
1.8 A probability design assumes that people who are sampled will actually be included in the sample, rather than refusing. Look up the response rates for the most recent year of BRFSS and NHANES.
Lumley is highlighting the fact that even though we set up samples thinking that every sample will be observed that is rarely the case. Looking at just the most recent NHANES data I see response rates at ~ 78% for the unweighted, 6-year household survey.
1.9 In a telephone study using random digit dialing, telephone numbers are sampled with equal probability from a list. When a household is recruited, why is it necessary to ask how many telephones are in the household, and what should be done with this information in computing the weights.
It is necessary to ask how many telephones are in the household to downweight the a priori sampling probability accordingly because every additional telephone line increases the odds that a given house is sampled. For example in a simple population with two houses, where house one has 5 telephones and house two has 2 telephones, and we’re looking to take a \(n=1\) sample, but we don’t know the number of telephones a priori, house one has a \(\frac{5}{7}\) probability of being sampled. If that is the house that is chosen its weight needs to go from 2 to \(\frac{5}{7}\) to better reflect its sampling probability. In a real sample this would be corrected relative to all the other households number of telephones or perhaps a population average of the number of telephones.
1.10
Derive the Horvitz Thompson variance estimator for the total as follows.
This follows in a straightforward fashion from the assumption that \(R_i\) is distributed according to the bernoulli distribution and \(R_i \perp R_j\). This is an accurate model for sampling with replacement, or sampling from large populations with small sample sizes without replacement, but less true for small sample sizes without replacement.
\[ V[\hat{T}_{HT}] = \sum_{i=1}^{N}\sum_{j=1}^{N} \check{x}_i\check{x}_j(\pi_{ij} - \pi_i \pi_j) \] We have, \[ \hat{T}_{HT} := \sum_{i=1}^{N} \frac{X_i I(X_i \in {S})}{\pi_i} \\ V[\hat{T}_{HT}] = V\left[\sum_{i=1}^{N}\frac{X_i I(X_i \in {S})}{\pi_i} \right] \\ =\sum_{i=1}^{N}\sum_{j=1}^{N} Cov\left[\frac{X_i I(i \in {S})}{\pi_i}, \frac{X_j I(j \in {S})}{\pi_j}\right]\\ = \sum_{i=1}^{N}\sum_{j=1}^{N} \frac{X_i}{\pi_i}\frac{X_j}{\pi_j}Cov(I(i \in {S}),I(j \in {S})) \\ = \sum_{i=1}^{N}\sum_{j=1}^{N} \frac{X_i}{\pi_i}\frac{X_j}{\pi_j} (\pi_{ij} - \pi_i \pi_j) \]
which is equivalent to the above, where \(\check{x_i} = \frac{X_i}{\pi_i}\).
To show the expression above is unbiased for \(\hat{V}[\hat{T}_{HT}]\) we must show that \(E\left [\hat{V}[\hat{T}_{HT}] \right] = V[\hat{T}_{HT}]\) \[ E \left [\hat{V}[\hat{T}_{HT} ] \right] = E \left[\sum_{i=1}^{N}\sum_{j=1}^{N} \frac{R_iR_j}{\pi_{ij}} \check{x}_i \check{x}_j(\pi_{ij} - \pi_i\pi_j) \right] \\ = \sum_{i=1}^{N} \sum_{j=1}^{N} \frac{E[R_iR_j]}{\pi_{ij}} \check{x}_i \check{x}_j (\pi_{ij} - \pi_i \pi_j) \\ = \sum_{i=1}^{N} \sum_{j=1}^{N} \frac{\pi_{ij}}{\pi_{ij}} \check{x}_i\check{x}_j(\pi_{ij} - \pi_i \pi_j) \\ = \sum_{i=1}^{N} \sum_{j=1}^{N} \check{x}_i\check{x}_j(\pi_{ij} - \pi_i \pi_j) \\ \blacksquare \]
\[ \sum_{i=1}^{N} \sum_{j=1}^{N} \frac{R_iR_jx_ix_j}{\pi_{ij}\pi_i\pi_j}(\pi_{ij} - \pi_i \pi_j) \\ = \sum_{i=1}^{n} \sum_{j=1}^{n} \frac{x_ix_j}{\pi_{ij}\pi_i\pi_j}(\pi_{ij} - \pi_i \pi_j) \\ = \sum_{i=1}^{n} \sum_{j=1}^{n} x_i x_j(\frac{1}{\pi_i \pi_j} - \frac{1}{\pi_{ij}}) \\ = \sum_{i=1}^{n} \sum_{j=1}^{n} \frac{x_ix_j}{\pi_i \pi_j} - \frac{x_i x_j}{\pi_{ij}}\ \]
I’m not sure how the signs switch on the last line to reproduce expression 1.2
1.11 Another popular way to write the Horvitz-Thompson variance estimator is
\[ \hat{V}[\hat{T}_{HT}] = \sum_{i=1}^{n} x_i^{2} \frac{1-\pi_i}{\pi_i^2} + \sum_{i\neq j}x_ix_j\frac{\pi_{ij} - \pi_i \pi_j}{\pi_i\pi_j\pi_{ij}} \]
Show that this is equivalent to equation 1.2
We need to show that the above is equivalent to \[ \sum_{i,j} \frac{X_iX_j}{\pi_{ij}} - \frac{X_i}{\pi_i}\frac{X_j}{\pi_j} \]
First we fix \(i \neq j\) in the above expression and we find \[ \sum_{i\neq j} \frac{X_iX_j}{\pi_{ij}} - \frac{X_i}{\pi_i}\frac{X_j}{\pi_j} = \sum_{i\neq j} X_iX_j(\frac{1}{\pi_{ij}} - \frac{1}{\pi_i} \frac{1}{\pi_j}) \\ = \sum_{i \neq j} X_iX_j(\frac{\pi_i \pi_j - \pi_{ij}}{\pi_{ij}\pi_i\pi_j}) \]
which is the latter part in the desired expression save for a sign, which again I must be missing somehow or is an error in the book.
Now we take \(i=j\) and return to expression 1.2 in which we have,
\[ \sum_{i=j} \frac{X_i X_j}{\pi_{ij}} - \frac{X_i}{\pi_i} \frac{X_j}{\pi_j} \\ = \sum_{i=j} \frac{X_i^2}{\pi_{ii}} - \frac{X_i}{\pi_i} \frac{X_i}{\pi_i} \\ = \sum_{i=j} X_i^2 (\frac{1}{\pi_{ii}} - \frac{1}{\pi_i^2} ) \\ = \sum_{i=j} X_i^2 (\frac{\pi_i^2 - \pi_{ii}}{\pi_i^2 \pi_{ii}}) \] Clearly we have to formulate \(\pi_{ii}\) in terms of \(\pi_i\) but isn’t immediately clear to me how to do so. We know that for the two terms to be equal we must have
\[ \frac{\pi_i^2 - \pi_{ii}}{\pi_i^2 \pi_{ii}} = \frac{1 - \pi_i^2}{\pi_i^2} \\ \iff \\ \pi_{ii} = \frac{\pi_i^2}{2- \pi_i^2} \] Which I suppose we’ll take to be the expression of a co-inclusion probability of an entity sampled with itself (this must assume sampling with replacement) for this expression to be true.
The HTE is an unbiased estimator of the population total - I reproduce the expression from above, but now make explicit the indicator variables that express which observations are included in our sample, \(S\).
\[ HTE := \sum_{i=1}^{N} \frac{X_i I(X_i \in S)}{\pi_i} \\ E[HTE] = E\left [\sum_n \frac{X_i I(X_i \in S)}{\pi_i} \right ] \\ = \sum_n E \left [\frac{X_iI(X_i \in S)}{\pi_i} \right ] \\ = \sum_n \frac{X_iE[I(X_i \in S)]}{\pi_i} \\ = \sum_n \frac{X_i \pi_i}{\pi_i} = \sum_n X_i \]
When dealing with a sample of size \(n\) from a population of size \(N\) the HTE of the total value of \(X_i\) in the population can be written as
\[ \begin{equation} HTE(X) = \hat{T_X} = \sum_{i=1}^{n} \frac{X_i}{\pi_i}. \end{equation} \]
For a simple random sample, the variance can be more explicitly written as
\[ \begin{equation} V[\hat{T_X}] = \frac{N-n}{N} \times N^{2} \times \frac{V[X]}{n}, \end{equation} \]
where \(\frac{N-n}{N}\) is the finite population correction factor. This factor is derived from the hypergeometric distribution and explains the reduction in uncertainty that follows from sampling a large portion of the population. Consequently, if the sample is taken with replacement — the same individual or unit has the possibility to be sampled twice — this term is no longer relevant. It should be noted that sampling with replacement is not usually used however, but sometimes this language is used to refer to the fact that the finite correction factor may not be used.
The second term, \(N^2\), rescales the estimate from the mean to the total, while the final term is simply the scaled variance of \(X\).
A point worth deliberating on, that Lumley notes as well, is that while the above equations suggest that a larger sample size is always better that is not always the case in reality. Non-response bias or the cost of surveys can dramatically diminish the quality of the dataset, even if the size is large. I state this is worth deliberating on because it is a matter of increasing importance in the world of “Big Data” - where it can be easy to delude oneself with confidence in their estimates because their sample is large, even when the sample is not well designed. See (Meng 2018) for a larger discussion of this topic.
It follows from the above that the HTE for the population size is defined as \(\hat{N} = \sum_{i=1}^{n} \frac{1}{\pi_i}\). This holds true in the case where, as here \(\pi_i = \frac{n}{N}\), a bit trivial, but also in those where \(\pi_i\) may be defined differently.
The sampling distribution for the estimates — typically sample means and sums — across “repeated surveys” is Normal by the Central Limit Theorem, so the typical \(\bar{x} \pm 1.96 \sqrt{\frac{\sigma^2_X}{n}}\), expression is used to calculate a 95% confidence interval. Lumley offers the following example from the California Academic Performance Index (API) dataset to illustrate this idea.
data(api)
mn_enroll <- mean(apipop$enroll, na.rm = TRUE)
p1 <- apipop %>%
ggplot(aes(x = enroll)) +
geom_histogram() +
xlab("Student Enrollment") +
geom_vline(xintercept = mn_enroll, linetype = 2, color = "red") +
ggtitle("Distribution of School Enrollment")
p2 <- replicate(n = 1000, {
apipop %>%
sample_n(200) %>%
pull(enroll) %>%
mean(., na.rm = TRUE)
})
mn_sample_mn <- mean(p2)
p2 <- tibble(sample_ix = 1:1000, sample_mean = p2) %>%
ggplot(aes(x = sample_mean)) +
geom_histogram() +
xlab("Student Enrollment Averages") +
geom_vline(
xintercept = mn_sample_mn,
linetype = 2, color = "red"
) +
ggtitle("Distribution of Sample Means")
p1 + p2
What follows is a work-up of basic survey estimates using the
California API dataset composed of student standardized test scores.
I’ll work through the code once using the survey package
and a second time using the srvyr package, which has a tidyverse friendly API.
Much of the computational work in this book begins with creating a design object, from which weights and other information can then be drawn on for any number/type of estimates.
For example, we create a basic design object below, where we look at a classic simple random sample (SRS) of the schools in the API dataset. Let’s take a look at the dataset first.
dplyr::as_tibble(apisrs)
## # A tibble: 200 × 39
## cds stype name sname snum dname dnum cname cnum flag pcttest api00
## <chr> <fct> <chr> <chr> <dbl> <chr> <int> <chr> <int> <int> <int> <int>
## 1 15739081… H "McF… McFa… 1039 McFa… 432 Kern 14 NA 98 462
## 2 19642126… E "Sto… Stow… 1124 ABC … 1 Los … 18 NA 100 878
## 3 30664493… H "Bre… Brea… 2868 Brea… 79 Oran… 29 NA 98 734
## 4 19644516… E "Ala… Alam… 1273 Down… 187 Los … 18 NA 99 772
## 5 40688096… E "Sun… Sunn… 4926 San … 640 San … 39 NA 99 739
## 6 19734456… E "Los… Los … 2463 Haci… 284 Los … 18 NA 93 835
## 7 19647336… M "Nor… Nort… 2031 Los … 401 Los … 18 NA 98 456
## 8 19647336… E "Gla… Glas… 1736 Los … 401 Los … 18 NA 99 506
## 9 19648166… E "Max… Maxs… 2142 Moun… 470 Los … 18 NA 100 543
## 10 38684786… E "Tre… Trea… 4754 San … 632 San … 37 NA 90 649
## # ℹ 190 more rows
## # ℹ 27 more variables: api99 <int>, target <int>, growth <int>, sch.wide <fct>,
## # comp.imp <fct>, both <fct>, awards <fct>, meals <int>, ell <int>,
## # yr.rnd <fct>, mobility <int>, acs.k3 <int>, acs.46 <int>, acs.core <int>,
## # pct.resp <int>, not.hsg <int>, hsg <int>, some.col <int>, col.grad <int>,
## # grad.sch <int>, avg.ed <dbl>, full <int>, emer <int>, enroll <int>,
## # api.stu <int>, pw <dbl>, fpc <dbl>
In the code below fpc stands for the aforementioned
finite population correction factor and id=~1 designates
the unit of analysis as each individual row in the dataset.
srs_design <- svydesign(id = ~1, fpc = ~fpc, data = apisrs)
srs_design
## Independent Sampling design
## svydesign(id = ~1, fpc = ~fpc, data = apisrs)
In order to calculate the mean enrollment based on this sample the,
appropriately named, svymean function can be used.
svymean(~enroll, srs_design)
## mean SE
## enroll 584.61 27.368
This is the same as the typical computation - which makes sense, this is a SRS!
c(
"Mean" = mean(apisrs$enroll),
"SE" = sqrt(var(apisrs$enroll) / nrow(apisrs))
)
## Mean SE
## 584.61000 27.82121
Instead of specifying the finite population correction factor, the sampling weights could be used - since this is a SRS, all the weights should be the same.
as_tibble(apisrs) %>% distinct(pw)
## # A tibble: 1 × 1
## pw
## <dbl>
## 1 31.0
nofpc <- svydesign(id = ~1, weights = ~pw, data = apisrs)
nofpc
## Independent Sampling design (with replacement)
## svydesign(id = ~1, weights = ~pw, data = apisrs)
Use svytotal to calculate the estimate of the total
across all schools, note that the standard error will be different
between the two designs because of the lack of fpc.
svytotal(~enroll, nofpc)
## total SE
## enroll 3621074 172325
svytotal(~enroll, srs_design)
## total SE
## enroll 3621074 169520
Totals across groups can be calculated using the ~
notation with a categorical variable.
svytotal(~stype, srs_design)
## total SE
## stypeE 4397.74 196.00
## stypeH 774.25 142.85
## stypeM 1022.01 160.33
svycontrast can be used to calculate the difference or
addition of two different estimates - below we estimate the difference
in the 2000 and 1999 scores based on the SRS design.
svycontrast(svymean(~ api00 + api99, srs_design), quote(api00 - api99))
## nlcon SE
## contrast 31.9 2.0905
srvyr packagedstrata <- apisrs %>%
as_survey_design(fpc = fpc)
dstrata %>%
mutate(api_diff = api00 - api99) %>%
summarise(
enroll_total = survey_total(enroll, vartype = "ci"),
api_diff = survey_mean(api_diff, vartype = "ci")
) %>%
gt()
| enroll_total | enroll_total_low | enroll_total_upp | api_diff | api_diff_low | api_diff_upp |
|---|---|---|---|---|---|
| 3621074 | 3286789 | 3955360 | 31.9 | 27.77764 | 36.02236 |
Simple random samples are not often used in complex surveys because there is a justified concern that some strata (e.g. racial ethnic group, age group, etc.) may be underrepresented in the sample if a simple random sample were used. Similarly, complex designs can give the same precision at a lower cost. Consequently, a sample may be constructed so that some units are guaranteed to be included within a given strata - improving the resulting variance. When this is a simple random sample, the HTE and variance of the total population is simply the sum of the strata specific estimates; \(\hat{T}_{HT} = \sum_{s=1}^{S} \hat{T}^{s}_X\), where there are \(S\) strata within the population.
For example, in the apistrat data set a stratified
random sample of 200 schools is recorded such that schools are sampled
randomly within school type ( elementary, middle school or high
school).
In the code below we can designate the strata using the categorical
variable stype, which denotes each of the school type as
strata.
strat_design <- svydesign(
id = ~1,
strata = ~stype,
fpc = ~fpc,
data = apistrat
)
strat_design
## Stratified Independent Sampling design
## svydesign(id = ~1, strata = ~stype, fpc = ~fpc, data = apistrat)
svytotal(~enroll, strat_design)
## total SE
## enroll 3687178 114642
svymean(~enroll, strat_design)
## mean SE
## enroll 595.28 18.509
svytotal(~stype, strat_design)
## total SE
## stypeE 4421 0
## stypeH 755 0
## stypeM 1018 0
Note that are standard errors are 0 for the within strata estimates because if we have strata information on each member of the population, then we know the strata counts without any uncertainty.
Several points worth noting about stratified samples before moving on.
Stratified samples get their power from “conditioning” on the strata information that explain some of the variability in the measure.
Whereas a SRS might have a chance of leaving out an elementary or middle school, and leaving a higher estimate of enrollment, because of a higher number of highschools in the sample, keeping a fixed number of samples within each strata removes this problem.
Stratified analysis may also refer to something entirely different from what we’re discussing here — where a subgroup has some model or estimate fit only on that subgroup’s data exclusively.
srvyr packagesrvyr_strat_design <- apistrat %>%
as_survey_design(
strata = stype,
fpc = fpc
)
srvyr_strat_design
## Stratified Independent Sampling design
## Called via srvyr
## Sampling variables:
## - ids: `1`
## - strata: stype
## - fpc: fpc
## Data variables:
## - cds (chr), stype (fct), name (chr), sname (chr), snum (dbl), dname (chr),
## dnum (int), cname (chr), cnum (int), flag (int), pcttest (int), api00
## (int), api99 (int), target (int), growth (int), sch.wide (fct), comp.imp
## (fct), both (fct), awards (fct), meals (int), ell (int), yr.rnd (fct),
## mobility (int), acs.k3 (int), acs.46 (int), acs.core (int), pct.resp (int),
## not.hsg (int), hsg (int), some.col (int), col.grad (int), grad.sch (int),
## avg.ed (dbl), full (int), emer (int), enroll (int), api.stu (int), pw
## (dbl), fpc (dbl)
srvyr_strat_design %>%
summarise(
enroll_total = survey_total(enroll),
enroll_mean = survey_mean(enroll)
) %>%
gt()
| enroll_total | enroll_total_se | enroll_mean | enroll_mean_se |
|---|---|---|---|
| 3687178 | 114641.7 | 595.2821 | 18.50851 |
srvyr_strat_design %>%
group_by(stype) %>%
survey_count()
## # A tibble: 3 × 3
## # Groups: stype [3]
## stype n n_se
## <fct> <dbl> <dbl>
## 1 E 4421. 0
## 2 H 755. 0
## 3 M 1018. 0
Replicate weights exploit sub-sampling to derive more generalizable statistics than sampling weights. This is particularly useful when estimating a “nonparametric” statistic like the median or a quantile which doesn’t have an easily derived variance.
For a basic idea of why this works, Lumley notes that one could estimate the variance of a total by using two independent half samples estimating the same total, i.e. if \(\hat{T}_A\) and \(\hat{T}_B\) are both from two independent half samples estimating \(\hat{T}\) then the variance of the difference of the two half samples is proportional to the variance of the original total:
\[ E\left[ (\hat{T}_A - \hat{T}_B)^2 \right] = 2 V[\hat{T}_A] = 4 V[\hat{T}]. \]
There are multiple ways one might set up these splits that are more efficient than the straightforward “half” sample described above - Lumley discusses 3 variants briefly:
Each of these ideas relies on the fundamental idea that we can calculate the variance of our statistic of interest by using — sometimes carefully chosen — subsamples of our original sample to calculate our statistic of interest and more importantly, the variance of that statistic. Lumley’s use of the equation above gives the basic idea but I believe the more rigorous justification appeals to theory involving empiricial distribution functions, as much of the theory underlying these ideas relies on getting a good estimate of the empirical distribution.
It isn’t explicitly clear which of these techniques is most popular currently, but my guess would be that the bootstrap is the most used. This also happens to be the method that Lumley has provided the most citations for in the text. I’ve also run into cases where the US Census IPUMS data uses successive difference weights.
All this to say that replicate weights are powerful for producing “non-parametric” estimates, like quantiles and so on, and different weighting techniques may be more or less appropriate depending on the design and data involved.
Lumley first demonstrates how to setup a survey design object when the weights are already provided. I’ve had trouble accessing the 2005 California Health Interview Survey data on my own but he thankfully provides a link to the data on his website.
chis_adult <- as.data.frame(read_dta("Data/ADULT.dta")) %>%
# have to convert labeled numerics to regular numerics for
# computation in survey package.
mutate(
bmi_p = as.numeric(bmi_p),
srsex = factor(srsex, labels = c("MALE", "FEMALE")),
racehpr = factor(racehpr, labels = c(
"LATINO", "PACIFIC ISLANDER",
"AMERICAN INDIAN/ALASKAN NATIVE",
"ASIAN", "AFRICAN AMERICAN",
"WHITE",
"OTHER SINGLE/MULTIPLE RACE"
))
)
chis <- svrepdesign(
variables = chis_adult[, 1:418],
repweights = chis_adult[, 420:499],
weights = chis_adult[, 419, drop = TRUE],
## combined.weights specifies that the replicate weights
## include the sampling weights
combined.weights = TRUE,
type = "other", scale = 1, rscales = 1
)
chis
## Call: svrepdesign.default(variables = chis_adult[, 1:418], repweights = chis_adult[,
## 420:499], weights = chis_adult[, 419, drop = TRUE], combined.weights = TRUE,
## type = "other", scale = 1, rscales = 1)
## with 80 replicates.
When creating replicate weights in R one specifies a
replicate type to the type argument.
boot_design <- as.svrepdesign(strat_design,
type = "bootstrap",
replicates = 100
)
boot_design
## Call: as.svrepdesign.default(strat_design, type = "bootstrap", replicates = 100)
## Survey bootstrap with 100 replicates.
By default, the as.svrepdesign() function assumes the
replicate weights are jackknife replicates.
## jackknife is the default
jk_design <- as.svrepdesign(strat_design)
jk_design
## Call: as.svrepdesign.default(strat_design)
## Stratified cluster jackknife (JKn) with 200 replicates.
Once the design object is created the mean of a variable can computed
equivalently as before using the svymean() function. We’ll
compare the bootstrap and jackknife estimates, noting that the bootstrap
has a higher standard error than the jackknife.
svymean(~enroll, boot_design)
## mean SE
## enroll 595.28 18.117
svymean(~enroll, jk_design)
## mean SE
## enroll 595.28 18.509
Of course, part of the motivation in using replicate weights is that
you’re able to estimate standard errors for non-trivial estimands,
especially those that may not be implemented in the survey
package. Lumley demonstrates this using a sample from the National Wilms Tumor
Study Cohort, in order to estimate the five year survival
probability via a Kaplan-Meier
Estimator.
library(addhazard)
nwtsco <- as_tibble(nwtsco)
head(nwtsco)
## # A tibble: 6 × 12
## trel tsur relaps dead study stage histol instit age yr specwgt tumdiam
## <dbl> <dbl> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <int>
## 1 21.9 21.9 0 0 3 1 1 1 2.08 1979 750 14
## 2 11.3 11.3 0 0 3 2 0 0 4.17 1979 590 9
## 3 22.1 22.1 0 0 3 1 1 1 0.75 1979 356 13
## 4 8.02 8.02 0 0 3 2 0 0 2.67 1979 325 9
## 5 20.5 20.5 0 0 3 2 0 0 3.67 1979 970 17
## 6 14.4 14.4 1 1 3 2 0 1 2.58 1979 730 15
cases <- nwtsco %>% filter(relaps == 1)
cases <- cases %>% mutate(wt = 1)
ctrls <- nwtsco %>% filter(relaps == 0)
ctrls <- ctrls %>%
mutate(wt = 10) %>%
sample_n(325)
ntw_sample <- rbind(cases, ctrls)
fivesurv <- function(time, status, w) {
scurve <- survfit(Surv(time, status) ~ 1, weights = w)
## minimum probability that corresponds to a survival time > 5 years
return(scurve$surv[min(which(scurve$time > 5))])
}
des <- svydesign(id = ~1, strata = ~relaps, weights = ~wt, data = ntw_sample)
jkdes <- as.svrepdesign(des)
withReplicates(jkdes, quote(fivesurv(trel, relaps, .weights)))
## theta SE
## [1,] 0.83518 0.0018
The estimated five year survival probability of 84% (95% CI: 84%,85%)
uses the fivesurv function which computes the kaplan meier
estimate of five year survival probability fora given time status and
weight. The withReplicates function then re-estimates this
statistic using each set of replicates and calculates the standard error
from the variability of these estimates.
Its worth noting that this is the standard error for estimating the five year survival in the NWTS cohort, not the hypothetical superpopulation of all children with Wilms’ tumor.
srvyr packageboot_design <- as_survey_rep(strat_design,
type = "bootstrap",
replicates = 100
)
boot_design
## Call: Called via srvyr
## Survey bootstrap with 100 replicates.
## Data variables:
## - cds (chr), stype (fct), name (chr), sname (chr), snum (dbl), dname (chr),
## dnum (int), cname (chr), cnum (int), flag (int), pcttest (int), api00
## (int), api99 (int), target (int), growth (int), sch.wide (fct), comp.imp
## (fct), both (fct), awards (fct), meals (int), ell (int), yr.rnd (fct),
## mobility (int), acs.k3 (int), acs.46 (int), acs.core (int), pct.resp (int),
## not.hsg (int), hsg (int), some.col (int), col.grad (int), grad.sch (int),
## avg.ed (dbl), full (int), emer (int), enroll (int), api.stu (int), pw
## (dbl), fpc (dbl)
boot_design %>% summarise(Mean = survey_mean(enroll))
## # A tibble: 1 × 2
## Mean Mean_se
## <dbl> <dbl>
## 1 595. 18.6
It’s not clear or straightforward to me from reading the
srvyr docs how
to estimate the weighted survival function probability — I may return to
this later.
Lumley finishes this section by noting that the bootstrap typically works better when all the strata are large. While a strata correction is available it is likely not correct for small or unequal strata.
Separately, Lumley note that both the jackknife and bootstrap can incorporate finite population correction factors.
Finally, the BRR designs implemented in the survey
package will use at most excess 4 replicate splits for \(K < 180\) and at most 5% when \(K > 100\). It is not clear to me from
the reading, which is more likely to be used for \(100 < K < 180\).
While population means, totals, and differences are typically easy to estimate via the horvitz thompson estimator there are other population statistics such as the median or regression estimates that are more complex. These require using the replicate weights described in the previous section or making certain linearization / interpolation assumptions which may or may not influence the resulting estimate.
Estimation of quantiles involves estimating arbitary points along the
cumulative
distribution function(cdf). For example, the 90th percentile has 90%
of the estimated population size below it and 10% above. In this case,
for cdf \(F_X(x)\), we want to estimate
\(x: F_X(x) = 0.9\). However,
estimating the cdf presents some technical difficulties in that the empirical
cumulative distribution function (ecdf), is not typically a “smooth”
estimate for any given \(x\) — as the
estimate is highly dependent upon the sample. Consequently, Lumley’s
function, svyquantile() interpolates linearly between two
adjacent observations when the quantile is not uniquely defined.
samp <- rnorm(20)
plot(ecdf(samp))
Empirical Cumulative Distribution Function - note the jumps at distinctive points along the x-axis.
Confidence intervals are constructed similarly, using the ecdf, though it should be noted that estimating extreme quantiles poses difficulties because of the low density values in the area.
A first calculation to demonstrate this using replicate weights with the CA health interview study, estimating different quantiles of BMI.
svyquantile(~bmi_p, design = chis, quantiles = c(0.25, 0.5, 0.75))
## $bmi_p
## quantile ci.2.5 ci.97.5 se
## 0.25 22.68 22.66 22.81 0.03767982
## 0.5 25.75 25.69 25.80 0.02763161
## 0.75 29.18 29.12 29.29 0.04270393
##
## attr(,"hasci")
## [1] TRUE
## attr(,"class")
## [1] "newsvyquantile"
The same thing can be done with the stratified design. Here the uncertainty is computed via the estimates of the ecdf and finding the pointwise confidence interval for different points along the curve.
svyquantile(~api00,
design = strat_design, quantiles = c(0.25, 0.5, 0.75),
ci = TRUE
)
## $api00
## quantile ci.2.5 ci.97.5 se
## 0.25 565 535 597 15.71945
## 0.5 668 642 694 13.18406
## 0.75 756 726 778 13.18406
##
## attr(,"hasci")
## [1] TRUE
## attr(,"class")
## [1] "newsvyquantile"
You can see how to construct the same estimate below using the
srvyr package.
srvyr_strat_design %>%
summarize(quantiles = survey_quantile(api00, quantiles = c(0.25, 0.5, 0.75)))
## # A tibble: 1 × 6
## quantiles_q25 quantiles_q50 quantiles_q75 quantiles_q25_se quantiles_q50_se
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 565 668 756 15.7 13.2
## # ℹ 1 more variable: quantiles_q75_se <dbl>
Lumley’s main points in this section focus on the complications in interpretation of typical contingency table tests of association in a design based setting. Specifically, he points out that it is not obvious how the null distribution should be generated without making some kind of modeling assumptions. Quoting from the book (text in parentheses from me):
For example, if there are 56,181,887 women and 62,710,561 men in a population it is not possible for the proportions of men and women who are unemployed to be the same, since these population sizes have no common factors. We would know without collecting any employment data that the finite-population null hypothesis (of equal proportions) was false. A more interesting question is whether the finite population could have arisen from a process that had no association between the variables: is the difference at the population level small enough to have arisen by chance…. A simpler approach is to treat the sample as if it came from an infinite superpopulation and simply ignore the finite-population corrections in inference.
The super-population approach offers the more interesting approach
philosophically and thus is implemented in the survey
package. The svychisq function implements a test for no
association as the null using a chi-squared distribution with a
correction for the mean and variance. Lumley discusses various methods
for computing the \(\chi^2\) statistic
in this setting and their implementations in svycontrast().
I’d suggest looking at the function documentation if that level of
detail is needed.
Lumley demonstrates how to call these functions estimating the proportion of smokers in each insurance status group from the California Health Interview Survey.
tab <- svymean(~ interaction(ins, smoking, drop = TRUE), chis)
tab
## mean SE
## interaction(ins, smoking, drop = TRUE)1.1 0.112159 0.0021
## interaction(ins, smoking, drop = TRUE)2.1 0.039402 0.0015
## interaction(ins, smoking, drop = TRUE)1.2 0.218909 0.0026
## interaction(ins, smoking, drop = TRUE)2.2 0.026470 0.0012
## interaction(ins, smoking, drop = TRUE)1.3 0.507728 0.0036
## interaction(ins, smoking, drop = TRUE)2.3 0.095332 0.0022
ftab <- ftable(tab, rownames = list(
ins = c("Insured", "Uninsured"),
smoking = c("Current", "Former", "Never")
))
round(ftab * 100, 1)
## ins Insured Uninsured
## smoking
## Current mean 11.2 3.9
## SE 0.2 0.1
## Former mean 21.9 2.6
## SE 0.3 0.1
## Never mean 50.8 9.5
## SE 0.4 0.2
In the output below we see a very small p-value indicating that the data were unlikely to be generated from a process in which smoking and insurance status were independent.
svychisq(~ smoking + ins, chis)
##
## Pearson's X^2: Rao & Scott adjustment
##
## data: svychisq(~smoking + ins, chis)
## F = 130.11, ndf = 1.9923, ddf = 157.3884, p-value < 2.2e-16
Estimation within subpopulations (also called domain estimation) that are sampled strata is easy since a stratified sample is composed of random samples within strata by definition; simply compute the desired statistic within the given strata using the strata-specific random sample.
When the subpopulation of interest is not a strata, things are more difficult. While the sampling weights would still be correct for representing any given observation to the total population — resulting in an unbiased mean point estimate — the co-occurrence probabilities \(\pi_{i,j}\) would be incorrect, because the co-occurrence probabilities are now unknown/random and not fixed by design. The two (usual) approaches for trying to estimate subpopulation variance estimates in-spite of these difficulties are linearization and replicate weighting. The computation is straightforward for replicate weighting since the non-subpopulation entities can simply be discarded in the computation. For linearization the computation is less straightforward as the extra entities still have to be included as “0”s in the final computation – this is according to Lumley’s arguments in his Appendix.
Examples below demonstrating this idea estimate the number of
teachers with emergency, emer, training amongst California
schools using the api dataset.
emerg_high <- subset(strat_design, emer > 20)
emerg_low <- subset(strat_design, emer == 0)
svymean(~ api00 + api99, emerg_high)
## mean SE
## api00 558.52 21.708
## api99 523.99 21.584
svymean(~ api00 + api00, emerg_low)
## mean SE
## api00 749.09 17.516
svytotal(~enroll, emerg_high)
## total SE
## enroll 762132 128674
svytotal(~enroll, emerg_low)
## total SE
## enroll 461690 75813
In general, if replicate weights are available, domain estimation is much easier.
bys <- svyby(~bmi_p, ~ srsex + racehpr, svymean,
design = chis,
keep.names = FALSE
)
print(bys, digits = 3)
## srsex racehpr bmi_p se
## 1 MALE LATINO 28.2 0.1447
## 2 FEMALE LATINO 27.5 0.1443
## 3 MALE PACIFIC ISLANDER 29.7 0.7055
## 4 FEMALE PACIFIC ISLANDER 27.8 0.9746
## 5 MALE AMERICAN INDIAN/ALASKAN NATIVE 28.8 0.5461
## 6 FEMALE AMERICAN INDIAN/ALASKAN NATIVE 27.0 0.4212
## 7 MALE ASIAN 24.9 0.1406
## 8 FEMALE ASIAN 23.0 0.1112
## 9 MALE AFRICAN AMERICAN 28.0 0.2663
## 10 FEMALE AFRICAN AMERICAN 28.4 0.2417
## 11 MALE WHITE 27.0 0.0598
## 12 FEMALE WHITE 25.6 0.0680
## 13 MALE OTHER SINGLE/MULTIPLE RACE 26.9 0.3742
## 14 FEMALE OTHER SINGLE/MULTIPLE RACE 26.7 0.3158
# This is the code from the book but it didn't work for me
# because of issues in the survey R package, I reproduce the
# first result using the srvyr package below
# medians <- svyby(~bmi_p, ~ srsex + racehpr, svyquantile,
# design = chis,
# covmat = TRUE,
# quantiles = 0.5
# )
# svycontrast(medians, quote(MALE.LATINO/FEMALE.LATINO))
medians <- chis %>%
as_survey() %>%
group_by(srsex, racehpr) %>%
summarize(Median_BMI = survey_median(bmi_p, vartype = "ci"))
medians
## # A tibble: 14 × 5
## # Groups: srsex [2]
## srsex racehpr Median_BMI Median_BMI_low Median_BMI_upp
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 MALE LATINO 27.4 27.3 27.8
## 2 MALE PACIFIC ISLANDER 28.9 27.8 30.7
## 3 MALE AMERICAN INDIAN/ALASKAN NATI… 28.4 26.9 29.3
## 4 MALE ASIAN 24.4 24.2 24.8
## 5 MALE AFRICAN AMERICAN 27.4 26.6 28.1
## 6 MALE WHITE 26.4 26.3 26.5
## 7 MALE OTHER SINGLE/MULTIPLE RACE 26.6 25.9 27.5
## 8 FEMALE LATINO 26.3 25.8 26.5
## 9 FEMALE PACIFIC ISLANDER 27.3 25.6 28.3
## 10 FEMALE AMERICAN INDIAN/ALASKAN NATI… 25.1 24.5 26.5
## 11 FEMALE ASIAN 22.1 22.0 22.4
## 12 FEMALE AFRICAN AMERICAN 27.2 26.6 27.5
## 13 FEMALE WHITE 24.3 24.2 24.4
## 14 FEMALE OTHER SINGLE/MULTIPLE RACE 25.7 25.1 26.5
How to pick the sample size for each strata? Well it depends on the goals of the analysis. If the goal is to estimate a total across the whole population, the formula for the variance of a total can be used to gain insights about optimal allocation. Since the variance of the total is dependent (via sum) of the strata specific variances, more sample size would want to be dedicated to more heterogeneous and/or larger strata.
This general approach means that the sample size for strata \(k\), \(n_k\) should be proportional to the population strata size \(N_k\) and strata variance \(\sigma^{2}_k\), \(n_k \propto \sqrt{N^2_k \sigma^2_k} = N_k \sigma_k\). Lumley notes that while this expression satisfies some theoretical optimality criteria, it is often the case that different strata have different costs associated with their sampling and so the expression can be modified in order to take into account this cost as follows:
\[ n_k \propto \frac{ N_k \sigma_k}{\sqrt{\text{cost}_k}}, \]
where cost\(_k\) is the cost of sampling for strata \(k\).
1.You are conducting a survey of emergency preparedness at a large HMO, where you want to estimate what proportion of medical staff would be able to get to work after an earthquake.
a.) You can either send out a single questionnaire to all staff, or send out a questionnaire to about 10% of the staff and make follow-up phone calls for those that don’t respond. What are the disadvantages of each approach?
This comes down to a discussion of cost for sampling and what missing data mechanism may be at play. As a simple starting point, if we were to assume the resulting data were MCAR and the non response rate was equivalent between both sampling strategies, the single questionnaire would be preferred because it would result in a higher overall sample size. These assumptions are probably not likely however, and we may expect that non-response is associated with other meaningful factors, by choosing a the follow-up phone call we might minimize non-response to both reduce bias and improve precision.
Additional relevant concerns would be the possible response or lack of response of certain strata — certain physicians, technicians or other kinds of staff’s response would likely be worth knowing yet these groups may be less well represented in a 10% simple random sample of the population.
b.) You choose to survey just a sample. What would be useful variables to stratify the sampling, and why?
The aforementioned job title would be useful to stratify on. This would likely be most useful to conduct within each department. Further, if the HMO has more than one site or clinic, that would be worth stratifying on as well for substantive reasons just as much as statistical reasons.
c.) The survey was conducted with just two strata: physicians and other staff. The HMO has 900 physicians and 9000 other staff. You sample 450 physicians and 450 other staff. What are the sampling probabilities in each stratum?
physician strata sampling probabilities are \(\frac{n}{N_k} = \frac{450}{900} = \frac{1}{2}\), while the “other staff” probabilities are \(\frac{450}{9000} = \frac{1}{20}\)
d.) 300 physicians and 150 other staff say they would be able to get to work after an earthquake. Give unbiased estimates of the proportion in each stratum and the total proportion.
The physician strata estimate would be \(\frac{300}{450} = \frac{2}{3}\). The staff
strata would be \(\frac{150}{450} =
\frac{1}{3}\) The total proportion would be \(\frac{2 \times 300 + 20 \times
150}{9900}\). This value can be recreated below with the
survey package as follows.
df <- tibble(
id = 1:900,
job = c(rep("MD", 450), rep("staff", 450)),
prep = c(rep(1, 300), rep(0, 150), rep(1, 150), rep(0, 300)),
weights = c(rep(2, 450), rep(20, 450))
)
hmo_design <- svydesign(strata = ~job, ids = ~0, weights = ~weights, data = df)
hmo_design
## Stratified Independent Sampling design (with replacement)
## svydesign(strata = ~job, ids = ~0, weights = ~weights, data = df)
svymean(~prep, hmo_design)
## mean SE
## prep 0.36364 0.0203
e.) How would you explain to the managers that commissioned the study how the estimate was computed and why it wasn’t just the number who said “yes” divided by the total number surveyed?
We sampled from the total population using the strata because we though these two groups would respond differently and indeed, they did. Physicians have are twice as likely to be able to make it to the hospital in the event of an emergency as general staff. However, physicians make up a much smaller proportion of the overall hospital workforce and so we need to down weight their responses, relative to general staff in order to ensure their response reflects their distribution in the total population, thus the total estimate of the HMO’s emergency preparedness is much closer to the “general” staff’s strata estimate of \(\frac{1}{3}\).
2.You are conducting a survey of commuting time and means of transport for a large university. What variables are likely to be available and useful for stratifying the sampling?
Probably worth stratifying on “role” at university — student vs. staff vs. professor. Each of these have varying amounts of income available and would likely determine their different means and, consequently, commute time of getting to campus. It might also be worth stratifying on the department of employment for staff and professors, as there can be a wide variability in these measures, again, by department.
3.-4. Skip because of CHIS data issues
mobility, emer, meals,
pcttest? Compare your expectations with the actual
results.The general principle here, is which of these variables do we expect to have some association with the school type. The greater association the more the benefit from stratifying.
meals, mean ell.A first point worth noting is that school size is an integer valued variable and so some grouping will have to be created to define the strata from which schools are then drawn. One possible option is to define the strata as above and below the median school enrollment. Since this divides the population exactly in half the strata are equally sized and the only differentiating factor is the variability of the enrollment.
as_tibble(apipop) %>%
transmute(
enroll = enroll,
strat = if_else(enroll > median(enroll, na.rm = TRUE), "Above", "Below")
) %>%
group_by(strat) %>%
summarize(sd_enroll = sd(enroll))
## # A tibble: 3 × 2
## strat sd_enroll
## <chr> <dbl>
## 1 Above 504.
## 2 Below 89.8
## 3 <NA> NA
Since the variability of the school enrollment sizes in the above median size schools is roughly five times that in the below median size schools, we’d sample from the two strata at a 5:1 ratio, respectively. Or, explicitly, we’d sample 160 schools from the above median school size and 40 schools from the below median school size.
api_opt_strat <- as_tibble(apipop) %>%
filter(!is.na(enroll)) %>%
transmute(
enroll = enroll,
strat = if_else(enroll > median(enroll, na.rm = TRUE), "Above", "Below")
)
above <- api_opt_strat %>%
filter(strat == "Above") %>%
mutate(fpc = n()) %>%
slice_sample(n = 160)
below <- api_opt_strat %>%
filter(strat == "Below") %>%
mutate(fpc = n()) %>%
slice_sample(n = 40)
opt_strat_design <- svydesign(
ids = ~1, strata = ~strat, fpc = ~fpc,
data = rbind(above, below),
)
svytotal(~enroll, opt_strat_design)
## total SE
## enroll 3889659 112859
The mean estimate is slightly closer to the true value, 3811472 but the standard error is slightly larger (119K vs 114K) compared to the previous stratified design. However, the other design sampled 1000s of schools per strata so this is remarkably efficient.
I won’t perform the other comparisons, but one would expect this design to be much less efficient at estimating the other variables if they are not well correlated with enrollment size.
I’ve already done something like this at the start of the chapter. We’d expect the central limit theorem (the normality of the sample means) to be better for the larger sample sizes listed. Or rather, we might not trust the standard error for the sample size of 10 to really represent the variability of the sample mean.
pop_median <- median(apipop$enroll)
p1 <- apipop %>%
ggplot(aes(x = enroll)) +
geom_histogram() +
xlab("Student Enrollment") +
geom_vline(xintercept = pop_median, linetype = 2, color = "red") +
ggtitle("Distribution of School Enrollment",
subtitle = "Median Enrollment Identified"
)
p <- replicate(n = 500, {
apipop %>%
sample_n(200) %>%
pull(enroll) %>%
median(., na.rm = TRUE)
})
p <- tibble(sample_ix = 1:500, sample_median = p) %>%
ggplot(aes(x = sample_median)) +
geom_histogram() +
xlab("Student Enrollment Medians") +
geom_vline(
xintercept = pop_median,
linetype = 2, color = "red"
) +
ggtitle("Distribution of Sample Medians")
p1 + p
The same result holds since the samples are independent within and across strata as well.
ns <- c(100, 52, 24, 12)
simdf <- map_dfr(1:200, function(sim_ix) {
map_dfr(ns, function(n) {
apipop %>%
filter(stype == "E") %>%
mutate(fpc = n()) %>%
slice_sample(n = n * .5) %>%
rbind(
.,
apipop %>%
filter(stype != "E") %>%
group_by(stype) %>%
mutate(fpc = n()) %>%
slice_sample(n = n * .25)
) %>%
as_survey_design(strata = stype, fpc = fpc) %>%
summarize(mean_enroll = survey_mean(enroll, na.rm = TRUE)) %>%
mutate(sim_ix = sim_ix, n = n)
})
})
simdf %>%
ggplot(aes(x = mean_enroll, fill = factor(n))) +
geom_histogram() +
theme(legend.title = element_blank()) +
ggtitle("CLT for stratified samples of different sizes.")
\[ \hat{V}[\hat{T}] = V_1 + V_2 \\ V_1 = \frac{N_1 - n_1}{N_1} N_1^2 \frac{\sigma^2_1}{n_1} \\ V_2 = \frac{N_2 - (n - n_1)}{N_2} N_2^2 \frac{\sigma^2_2}{n_2} \] taking the derivative …
\[ \frac{d\hat{V}[\hat{T}]}{dn_1} = \frac{dV_1}{dn_1} + \frac{dV_2}{dn_1} \\ = \frac{-N_1^2 \sigma^2_1}{n_1^2} + \frac{N_2^2 \sigma_2^2}{(n - n_1)^2} \] setting to zero and and solving for \(n_1\) we get \[ \frac{-N_1^2 \sigma^2_1}{n_1^2} + \frac{N_2^2 \sigma_2^2}{(n - n_1)^2} = 0 \\ \iff \\ \frac{N_2^2\sigma^2_2}{(n-n_1)^2} = \frac{N_1^2\sigma_1^2}{n_1^2}\\ \iff \\ \frac{n_1^2}{(n-n_1)^2} = \frac{N_1^2\sigma_1^2}{N_2^2\sigma^2_2} \\ \iff \\ \frac{n_1}{(n-n_1)} = \frac{N_1\sigma_1}{N_2\sigma} \\ \iff \\ n_1 = \frac{nN_1\sigma_1}{N_2\sigma_2(1 + \frac{N_1\sigma_1}{N_2\sigma_2})} \\ = \frac{nN_1\sigma_1}{N_2\sigma_2 + N_1\sigma_1} \]
Where the square root is taken over variables constrained to be positive so we only have the positive values as the solution to the equation.
Also, \(N_1, N_2\) are taken to be the population strata sizes and \(\sigma_1, \sigma_2\) are the strata standard deviations.
We can check the second derivative to ensure this is a global optima or we can use the fact that the variance is a quadratic function and is therefore convex. Consequently, there is only one global optima.
Examining the expression - we see the optimal strata size is the variance of each strata weighted by the size of the strata as a fraction of the same value added across all strata, i.e. the population variance.
Here we’ve derived the solution for the “first” strata, but this is arbitrary and there would be a similar value for any strata, for a design with any number of strata.
strat_var_sample <- function(n_1, n_2, N_1, N_2, sigma_1, sigma_2) {
one_strat_var <- function(n, N, sigma) {
(N - n) / N * N^2 * sigma^2 / n
}
return(one_strat_var(n_1, N_1, sigma_1) + one_strat_var(n_2, N_2, sigma_2))
}
expand.grid(
n_1 = seq(2, 100, 10), n_2 = seq(2, 100, 10),
N_1 = 1E3, N_2 = 1E3, sigma_1 = 1, sigma_2 = 1
) %>%
as_tibble() %>%
mutate(
var = map2_dbl(n_1, n_2, strat_var_sample, unique(N_1), unique(N_2), unique(sigma_1), unique(sigma_2)),
n_2 = factor(n_2)
) %>%
ggplot(aes(x = n_1, y = var, color = n_2)) +
geom_point() +
geom_line() +
ylab("Variance") +
ggtitle("Two Strata Design Variance", subtitle = "For Varying Strata Sizes")
I think the point Lumley is after is that the variance doesn’t change dramatically after each sample size is roughly more than ten. This is the quadratic behavior of the variance at play. Obviously the “optimal” variance will be found at the highest \(n_1\) and \(n_2\) as those values will provide the lowest variance. In terms of where the greatest efficiency lies, however, we see the greatest change in the variance estimates after the two sample sizes are greater than 10, as stated previously.
\[ \frac{n \choose 2}{N\choose n} = \frac{n!}{(n-2)!2!} \frac{(N-n)!n!}{N!} \\ =\frac{n(n-1)n!}{2N(N-1)} \] I don’t see yet how this reduces to the intended form but may return to this later.
Compute \(\pi_{ij} - \pi_i \pi_j\)
Show that the equation in exercise 1.10 (c) reduces to equation 2.2
Suppose instead each individual in the population is independently sampled with probability \(\frac{n}{N}\), so that the sample size \(n\) is not fixed. Show that the finite population correction disappears from equation 2.2 for this Bernoulli sampling design.
Why sample clusters? Because sometimes it’s easier than sampling individuals. Specifically, in cases where the cost of sampling individuals can be quite high, sampling clusters can be more efficient. This is in spite of the fact that within cluster correlation tends to be positive, reducing the information in the sample. Lumley uses the NHANES survey to motivate this idea: moving mobile examination centers all across the country to sample individuals is extremely expensive. By sampling a large number of individuals within a census tract aggregation area the NHANES survey is able to reduce the cost of their effort at a reasonable expense in precision.
Depending on the type of clusters involved it can be easy to sample the entire cluster as classrooms, medical practice and workplaces are, however it is more likely that some subsampling within clusters will be performed for the sake of efficiency. As Lumley notes, clusters in the first stage are called Primary Sampling Units or PSUs. “Stages” refer to the different levels at which sampling occurs. E.g. Sampling individuals within sampled census tracts within a state would involve sampling census tracts in the first stage and then individuals in the second stage. The diagram below communicates this idea graphically.
Sampling weights are determined assuming independence across stages — e.g. if a cluster of houses is sampled with probability \(\pi_1\) and a household is sampled within that cluster with probability \(\pi_2\) then the sampling probability for that house is \(\pi = \pi_1 \times \pi_2\) and it’s weight is the inverse of that probability. Note that this requires that clusters be mutually exclusive - a sampled unit can belong only to one cluster and no others. Further, note that we can still have biased sampling within a stage, as independence is only required across stages to use to find probabilities via their product.
Lumley goes on to describe how cluster sampling and individual sampling can be mixed since each stratum of a survey can be thought of as a separate and independent sample it is trivial to combine single stage sampling in one stratum and multistage sampling in another; a stratified random sample can be used in high density regions where measurement of multiple units is less costly and a cluster sample can be taken in low density regions where the cost of each additional unit is more costly.
The statistical rationale behind this strategy is fairly straightforward — since the variance of the sum is the sum of the variances of each stage (assuming independence) each sampled cluster in a multistage sample can be considered as a population for further sampling. Lumley uses the example of a simplified NHANES design, where 64 regions are grouped into 32 strata. A simple random sample of 440 individuals are then measured in each region. In Lumley’s words,
The variance of an estimated total from this design can be partitioned across two sources: the variance of each estimated regional total around the true total of the region and the variance that would result if the true total for each of the 64 sampled regions were known exactly.
In my own words and understanding, I understand there to be variance that comes from grouping the 64 regions into 32 strata — so there is uncertainty across region and then the uncertainty within that region that results from the sample of only a subset of the population.
In order to specify a single stage cluster sample or a multistage
sample treated as a single stage sample with replacement, the main
difference is that the PSU identifier needs to be supplied to the
id argument, as follows.
# Data originally found at
# "https://github.com/cran/LogisticDx/blob/master/data/nhanes3.rda"
names3 <- load("Data/nhanes/nhanes3.rda")
as_tibble(nhanes3)
## # A tibble: 17,030 × 16
## SEQN SDPPSU6 SDPSTRA6 WTPFHX6 HSAGEIR HSSEX DMARACER BMPWTLBS BMPHTIN
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 3 1 44 1735. 21 male white 180. 70.4
## 2 4 1 43 1725. 32 female white 136. 63.9
## 3 9 2 43 19452. 48 female white 150. 61.8
## 4 10 1 6 27770. 35 male white 204. 69.8
## 5 11 2 40 1246. 48 male white 155. 66.2
## 6 19 1 35 3861. 44 male black 190. 70.2
## 7 34 1 13 5032. 42 female black 126. 62.6
## 8 44 1 8 28149. 24 female white 123. 64.4
## 9 45 1 22 4582. 67 female black 150. 64.3
## 10 48 1 24 26919. 56 female white 240. 67.6
## # ℹ 17,020 more rows
## # ℹ 7 more variables: PEPMNK1R <dbl>, PEPMNK5R <dbl>, HAR1 <fct>, HAR3 <fct>,
## # SMOKE <fct>, TCP <dbl>, HBP <fct>
svydesign(
id = ~SDPPSU6, strat = ~SDPSTRA6,
weight = ~WTPFHX6,
## nest = TRUE indicates the PSU identifier is nested
## within stratum - repeated across strata
nest = TRUE,
data = nhanes3
)
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (98) clusters.
## svydesign(id = ~SDPPSU6, strat = ~SDPSTRA6, weight = ~WTPFHX6,
## nest = TRUE, data = nhanes3)
SDPPSU6 is the pseudo PSU variable, and SDPSTRA6 is the stratum identifier defined for the single stage analysis.
For example, a two stage design for the API population that samples 40 school districts, then five schools within each district , the design has population size 757 at the first stage for the number of school districts in CA and the number of schools within each district for the second stage. The weights need not be supplied if they can be worked out from the other arguments.
data(api)
as_tibble(apiclus2)
## # A tibble: 126 × 40
## cds stype name sname snum dname dnum cname cnum flag pcttest api00
## <chr> <fct> <chr> <chr> <dbl> <chr> <int> <chr> <int> <int> <int> <int>
## 1 31667796… E Alta… Alta… 3269 Alta… 15 Plac… 30 NA 100 821
## 2 55751846… E Tena… Tena… 5979 Big … 63 Tuol… 54 NA 100 773
## 3 41688746… E Pano… Pano… 4958 Bris… 83 San … 40 NA 98 600
## 4 41688746… M Lipm… Lipm… 4957 Bris… 83 San … 40 NA 100 740
## 5 41688746… E Bris… Bris… 4956 Bris… 83 San … 40 NA 98 716
## 6 40687266… E Cayu… Cayu… 4915 Cayu… 117 San … 39 NA 100 811
## 7 20651936… E Full… Full… 2548 Chow… 132 Made… 19 NA 100 472
## 8 20651936… E Fair… Fair… 2550 Chow… 132 Made… 19 NA 100 520
## 9 20651936… M Wils… Wils… 2549 Chow… 132 Made… 19 NA 100 568
## 10 06615980… H Colu… Colu… 348 Colu… 152 Colu… 5 NA 96 591
## # ℹ 116 more rows
## # ℹ 28 more variables: api99 <int>, target <int>, growth <int>, sch.wide <fct>,
## # comp.imp <fct>, both <fct>, awards <fct>, meals <int>, ell <int>,
## # yr.rnd <fct>, mobility <int>, acs.k3 <int>, acs.46 <int>, acs.core <int>,
## # pct.resp <int>, not.hsg <int>, hsg <int>, some.col <int>, col.grad <int>,
## # grad.sch <int>, avg.ed <dbl>, full <int>, emer <int>, enroll <int>,
## # api.stu <int>, pw <dbl>, fpc1 <dbl>, fpc2 <int[1d]>
## dnum = district id
## snum = school id
## fpc1 = school id number
clus1_design <- svydesign(id = ~dnum, fpc = ~fpc, data = apiclus1)
clus2_design <- svydesign(
id = ~ dnum + snum, fpc = ~ fpc1 + fpc2,
data = apiclus2
)
clus2_design
## 2 - level Cluster Sampling design
## With (40, 126) clusters.
## svydesign(id = ~dnum + snum, fpc = ~fpc1 + fpc2, data = apiclus2)
When only one PSU exists within a population stratum, the sampling
fraction must be 100%, since otherwise it would be 0%. In this
case, the stratum does not contribute to the first stage variance and it
should be ignored in calculating the first stage variance. Lumley argues
that the best way to handle a stratum with only one PSU is to combine it
with another stratum, one that is chosen to be similar based on
population data available before the study was done. The
survey package has two different methods implemented to
handle “lonely” PSU’s. Lumley has written further on this topic here.
Here Lumley walks through an example detailing the trade-offs involved in using the single stage approximation. I’ll try to come up with a simulated example later as the data is not listed on the book’s website nor is it clear how to reassemble his dataset from the files at the NHIS site.
Why do white sheep eat more than black sheep? There are more white sheep than black sheep
A specific design theory, Probability-proportional-to-size (PPS), cluster sampling is a sampling strategy that exploits the fact that for a simple random sample of an unstratified population \(\pi_i\) can be chosen such that it is approximately proportional to \(X_i\), the variable of interest, the resulting variance of the estimate of the total \(V[\hat{T}] = \frac{N-n}{N} N^{2} \frac{V[X]}{n}\) can then be controlled to be quite small. These are the same ideas I discussed at the start of the notes but more discussion on this topic can be found in (Tillé 2006; Hanif and Brewer 1980).
data(election)
election <- as_tibble(election) %>%
mutate(
votes = Bush + Kerry + Nader,
p = 40 * votes / sum(votes)
)
election %>%
ggplot(aes(x = Kerry, y = Bush)) +
geom_point() +
scale_y_log10() +
scale_x_log10() +
ggtitle("Correlation in Voting Totals from US 2004 Presidential Election",
subtitle = "Both x and y axes are on log 10 scales."
)
When Lumley’s book was written, only the single stage approximation
of PPS could be analyzed using the survey package. A demo
is shown below using the voting data, where a PPS sample is constructed
and then analyzed.
data(election)
election <- as_tibble(election) %>%
mutate(
votes = Bush + Kerry + Nader,
p = 40 * votes / sum(votes)
)
election
## # A tibble: 4,600 × 8
## County TotPrecincts PrecinctsReporting Bush Kerry Nader votes p
## <fct> <int> <int> <int> <int> <int> <int> <dbl>
## 1 Alaska 439 439 151876 86064 3890 241830 0.0832
## 2 Autauga 22 22 15212 4774 74 20060 0.00691
## 3 Baldwin 50 50 52910 15579 371 68860 0.0237
## 4 Barbour 24 24 5893 4826 26 10745 0.00370
## 5 Bibb 16 16 5471 2089 12 7572 0.00261
## 6 Blount 26 26 17364 3932 92 21388 0.00736
## 7 Bullock 27 27 1494 3210 3 4707 0.00162
## 8 Butler 27 27 4978 3409 13 8400 0.00289
## 9 Calhoun 53 53 29806 15076 182 45064 0.0155
## 10 Chambers 24 24 7618 5346 42 13006 0.00448
## # ℹ 4,590 more rows
insample <- sampling::UPtille(election$p)
ppsample <- election[insample == 1, ]
ppsample$wt <- 1 / ppsample$p
pps_design <- svydesign(id = ~1, weight = ~wt, data = ppsample)
pps_design
## Independent Sampling design (with replacement)
## svydesign(id = ~1, weight = ~wt, data = ppsample)
svytotal(~ Bush + Kerry + Nader, pps_design, deff = TRUE)
## total SE DEff
## Bush 54878056 2423791 0.0077
## Kerry 60850222 2414680 0.0040
## Nader 470827 79556 0.0821
The loss of precision per observation from cluster sampling is given by the design effect.
“For a single-stage cluster sample with all clusters having the same number of individuals, \(m\), the design effect is
\[ DEff = 1 + (m-1)\rho, \]
where \(\rho\) is the within-cluster correlation.
Lumley illustrates how design effects can illustrate the impact on inference using the California school data set from before as well as the Behavioral Risk Factor Surveillance System from 2007.
svymean(~ api00 + meals + ell + enroll, clus1_design, deff = TRUE)
## mean SE DEff
## api00 644.1694 23.5422 9.2531
## meals 50.5355 6.2690 10.3479
## ell 27.6120 2.0193 2.6711
## enroll 549.7158 45.1914 2.7949
In the above, the variance is up to 10 times higher in the cluster sample as compared to a simple random sample.
## Lumley renames clus2_design to dclus2 from before. I maintain the same names.
svymean(~ api00 + meals + ell + enroll, clus2_design, deff = TRUE, na.rm = TRUE)
## mean SE DEff
## api00 673.0943 31.0574 6.2833
## meals 52.1547 10.8368 11.8585
## ell 26.0128 5.9533 9.4751
## enroll 526.2626 80.3410 6.1427
These values increase slightly for all measures except
api00 in the two stage cluster sampling design. Lumley
points out that these large design effects demonstrate how variable the
measures of interest are between cluster, suggesting that the sampling
of clusters, while efficient economically are not as efficient
statistically.
Similarly, when computing the proportion of individuals who have more than 5 servings of fruits and vegetables a day (X_FV5SRV = 2), as well as how often individuals received a cholesterol test in the past 5 years (X_CHOLCHK = 1) from the 2007 Behavioral Risk Factor Surveillance System dataset, we see design effects that reflect the geographic variability across the blocks of telephone numbers that were sampled for the survey.
brfss <- svydesign(
id = ~X_PSU, strata = ~X_STATE, weight = ~X_FINALWT,
data = "brfss", dbtype = "SQLite",
dbname = "data/BRFSS/brfss07.db", nest = TRUE
)
brfss
## DB-backed Stratified Independent Sampling design (with replacement)
## svydesign(id = ~X_PSU, strata = ~X_STATE, weight = ~X_FINALWT,
## data = "brfss", dbtype = "SQLite", dbname = "data/BRFSS/brfss07.db",
## nest = TRUE)
food_labels <- c("Yes Veg", "No Veg")
chol_labels <- c("within 5 years", ">5 years", "never")
svymean(
~ factor(X_FV5SRV) +
factor(X_CHOLCHK),
brfss,
deff = TRUE
)
## mean SE DEff
## factor(X_FV5SRV)1 0.73096960 0.00153359 5.1632
## factor(X_FV5SRV)2 0.23844253 0.00145234 5.0147
## factor(X_FV5SRV)9 0.03058787 0.00069991 7.1323
## factor(X_CHOLCHK)1 0.73870300 0.00168562 6.3550
## factor(X_CHOLCHK)2 0.03230828 0.00058759 4.7676
## factor(X_CHOLCHK)3 0.19989559 0.00162088 7.0918
## factor(X_CHOLCHK)9 0.02909313 0.00055471 4.7029
Lumley notes that design based inference continues to differ from model based in its analysis of repeated measurements. Where model based inference is careful to account for modeling the – for example – within person or within household correlation in a cohort study, no such adjustment is required in a designed survey other than adjusting and using the appropriate weights - treating the repeated measurement like another stage of clustering in the sampling.
Lumley illustrates this with the Survey of Income and Program Participation (SIPP) panel survey.
Each panel is followed for multiple years, with subsets of the panel participating in four month waves of follow-up… wave 1 of the 1996 panel, which followed 36,730 households with interviews every four months, starting in late 1995 or early 1996… The households were recruited in a two-stage sample. The first stage sampled 322 counties or groups of counties as PSUs; the second stage sampled households within these PSUs.
Lumley demonstrates how to estimate repeated measures with panel data
using the survey package via the code below. 5 quantiles
are estimated across the population and across the 8 months. When Lumley
mentions that there is no need for adjusting for correlation in the
blockquote above, I believe he is referring to the within-month point
estimates. If we were to try and estimate the change in, say, income as
a function of other covariates I believe we would want to adjust for
correlation in order to get the appropriate standard errors. For the
point estimates Lumley points out that the weights are exactly as
required for the per-month estimate, but would need to be divided by the
number of samples when totaling across the number of measurements.
Proportions or regressions are invariant to this scaling factor so no
adjustment is needed there.
sipp_hh <- svydesign(
id = ~ghlfsam, strata = ~gvarstr, nest = TRUE,
weight = ~whfnwgt, data = "household", dbtype = "SQLite",
dbname = "Data/SIPP/sipp.db"
)
sipp_hh <- update(sipp_hh,
month = factor(rhcalmn,
levels = c(12, 1, 2, 3, 4, 5, 6),
labels = c(
"Dec", "Jan", "Feb", "Mar",
"Apr", "May", "Jun"
)
)
)
qinc <- svyby(~thtotinc, ~month, svyquantile,
design = sipp_hh,
quantiles = c(0.25, 0.5, 0.75, 0.9, 0.95), se = TRUE
)
pltdf <- as_tibble(qinc) %>%
select(month, contains("thtotinc"), -contains("se")) %>%
gather(everything(), -month, key = "quantile", value = "Total Income") %>%
mutate(quantile = as.numeric(str_extract(quantile, "[0-9].[0-9]?[0-9]")) * 100)
se <- as_tibble(qinc) %>%
select(month, contains("se")) %>%
gather(everything(), -month, key = "quantile", value = "SE") %>%
mutate(quantile = as.numeric(str_extract(quantile, "[0-9].[0-9]?[0-9]")) * 100)
pltdf <- pltdf %>%
left_join(se) %>%
mutate(
lower = `Total Income` - 2 * SE,
upper = `Total Income` + 2 * SE
)
pltdf %>%
mutate(quantile = factor(quantile)) %>%
ggplot(aes(x = month, y = `Total Income`, color = quantile)) +
geom_pointrange(aes(ymin = lower, ymax = upper)) +
xlab("Month in 1995/1996") +
ylab("Total Income (USD)") +
ggtitle("Total Income Quantiles",
subtitle = "Survey of Income and Program Participation"
)
Lumley notes here that we’re only talking about correlation at all here because the same individuals are being measured across time. In a study like NHANES that recruits different individuals each year, we can effectively assume the samples are independent since the overall population si so big.
Estimation would be much more complicated if the samples overlapped in some way.
## data are from the book website:
# https://r-survey.r-forge.r-project.org/svybook/
# demographic data
ddf <- haven::read_xpt("data/nhanesxpt/demo_c.xpt")
# blood pressure data
bpdf <- haven::read_xpt("data/nhanesxpt/bpx_c.xpt")
bpdf <- merge(ddf, bpdf, by = "SEQN", all = FALSE) %>%
mutate(
sys_over_140 = (BPXSAR > 140) * 1,
dia_over_90 = (BPXDAR > 90) * 1,
hbp = (sys_over_140 + dia_over_90 == 2) * 1,
age_group = cut(RIDAGEYR, c(0, 21, 65, Inf)),
sex = factor(RIAGENDR, levels = 1:2, labels = c("Male", "Female"))
)
bpdf_design <- bpdf %>%
select(
sys_over_140, dia_over_90, hbp, WTMEC2YR, SDMVPSU, SDMVSTRA,
age_group, sex, RIDAGEYR, BPXSAR, BPXDAR, RIAGENDR, RIDAGEMN
) %>%
as_survey_design(
weights = WTMEC2YR,
id = SDMVPSU,
strata = SDMVSTRA,
nest = TRUE
)
bpdf_design
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## Called via srvyr
## Sampling variables:
## - ids: SDMVPSU
## - strata: SDMVSTRA
## - weights: WTMEC2YR
## Data variables:
## - sys_over_140 (dbl), dia_over_90 (dbl), hbp (dbl), WTMEC2YR (dbl), SDMVPSU
## (dbl), SDMVSTRA (dbl), age_group (fct), sex (fct), RIDAGEYR (dbl), BPXSAR
## (dbl), BPXDAR (dbl), RIAGENDR (dbl), RIDAGEMN (dbl)
Lumley doesn’t tell us which variable name corresponds to the measurements indicated and I didn’t see any documentation in the files included so I just guess here.
bpdf_design %>%
summarize(
prop_hbp_one =
survey_ratio(sys_over_140, n(), na.rm = TRUE, vartype = "ci", deff = TRUE),
prop_hbp_two =
survey_ratio(sys_over_140, n(), na.rm = TRUE, vartype = "ci", deff = TRUE),
prob_hbp = survey_ratio(hbp, n(), na.rm = TRUE, vartype = "ci", deff = TRUE)
) %>%
gather(everything(), key = "metric", value = "value")
## # A tibble: 12 × 2
## metric value
## <chr> <dbl>
## 1 prop_hbp_one 0.0000119
## 2 prop_hbp_one_low 0.0000104
## 3 prop_hbp_one_upp 0.0000135
## 4 prop_hbp_one_deff 3.60
## 5 prop_hbp_two 0.0000119
## 6 prop_hbp_two_low 0.0000104
## 7 prop_hbp_two_upp 0.0000135
## 8 prop_hbp_two_deff 3.60
## 9 prob_hbp 0.00000217
## 10 prob_hbp_low 0.00000141
## 11 prob_hbp_upp 0.00000293
## 12 prob_hbp_deff 4.15
Included in the output above.
RIDAGEYR) and
gender (RIAGENDR)nhanes_ests <- bpdf_design %>%
group_by(sex, age_group) %>%
summarize(
prop_hbp_one =
survey_ratio(sys_over_140, n(), na.rm = TRUE, vartype = "ci", deff = TRUE),
prop_hbp_two =
survey_ratio(sys_over_140, n(), na.rm = TRUE, vartype = "ci", deff = TRUE),
prob_hbp = survey_ratio(hbp, n(), na.rm = TRUE, vartype = "ci", deff = TRUE)
) %>%
ungroup()
nhanes_ests
## # A tibble: 8 × 14
## sex age_group prop_hbp_one prop_hbp_one_low prop_hbp_one_upp
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 Male (0,21] 0.000000953 -0.000000551 0.00000246
## 2 Male (21,65] 0.0000647 0.0000486 0.0000809
## 3 Male (65,Inf] 0.000478 0.000402 0.000554
## 4 Male <NA> NaN NaN NaN
## 5 Female (0,21] 0.000000660 -0.000000521 0.00000184
## 6 Female (21,65] 0.0000597 0.0000518 0.0000677
## 7 Female (65,Inf] 0.000738 0.000655 0.000820
## 8 Female <NA> NaN NaN NaN
## # ℹ 9 more variables: prop_hbp_one_deff <dbl>, prop_hbp_two <dbl>,
## # prop_hbp_two_low <dbl>, prop_hbp_two_upp <dbl>, prop_hbp_two_deff <dbl>,
## # prob_hbp <dbl>, prob_hbp_low <dbl>, prob_hbp_upp <dbl>, prob_hbp_deff <dbl>
This produces a bunch of output, so I plot the high blood pressure result — both high systolic and diastolic blood pressure – below.
nhanes_ests %>%
select(age_group, sex, prob_hbp, prob_hbp_low, prob_hbp_upp) %>%
filter(!is.na(age_group)) %>%
ggplot(aes(x = age_group, y = prob_hbp, color = sex)) +
geom_pointrange(aes(ymin = prob_hbp_low, ymax = prob_hbp_upp),
position = position_dodge(width = .25)
) +
ylab("% US Population with High Blood Pressure") +
xlab("Age Group") +
ggtitle("Prevalence of High Blood Pressure in U.S. Across Age and Sex",
subtitle = "Data from NHANES 2003-2004"
) +
scale_y_continuous(labels = scales::percent)
3.2 Repeat the sampling and estimation in Figure 3.4 1000 times.
FYI the sampling::UPtille function takes a while so
running this simulation 1000 times takes a while.
OneSimulation <- function() {
insample <- sampling::UPtille(election$p)
ppsample <- election[insample == 1, ]
ppsample$wt <- 1 / ppsample$p
pps_design <- svydesign(id = ~1, weight = ~wt, data = ppsample, pps = "brewer")
total <- svytotal(~ Bush + Kerry + Nader, pps_design)
std_err <- sqrt(diag(attr(total, "var")))
ci <- confint(total)
out <- cbind(total, std_err, ci)
}
total_sims <- replicate(10, OneSimulation())
estimated_totals <- as_tibble(t(rowMeans(total_sims[, 1, ]))) %>%
mutate(label = "Estimate")
true_totals <- as_tibble(t(colSums(election[, c("Bush", "Kerry", "Nader")]))) %>%
mutate(label = "Truth")
rbind(estimated_totals, true_totals) %>%
gather(everything(), -label, key = "Candidate", value = "Vote Count") %>%
spread(label, `Vote Count`) %>%
gt::gt() %>%
gt::fmt_scientific() %>%
gt::tab_header("Simulated Total Comparison")
| Simulated Total Comparison | ||
| Candidate | Estimate | Truth |
|---|---|---|
| Bush | 6.02 × 107 | 5.96 × 107 |
| Kerry | 5.56 × 107 | 5.61 × 107 |
| Nader | 4.14 × 105 | 4.04 × 105 |
These are quite close.
b). Compute the mean of the estimated standard errors and compare it to the true simulation standard error, that is, the standard deviation of the estimated totals.
estimated_stderrs <- as_tibble(t(rowMeans(total_sims[, 2, ]))) %>%
mutate(label = "Estimate")
true_stderr <- as_tibble(apply(total_sims, 1, sd)) %>%
mutate(label = "Truth", Candidate = c("Bush", "Kerry", "Nader")) %>%
spread(Candidate, value)
rbind(estimated_stderrs, true_stderr) %>%
gather(everything(), -label, key = "Candidate", value = "Vote Std.Err") %>%
spread(label, `Vote Std.Err`) %>%
gt::gt() %>%
gt::fmt_scientific() %>%
gt::tab_header("Simulated Standard Error Comparison")
Same as above.
totals <- unlist(true_totals[, 1:3])
prop_contained <- rowMeans((total_sims[, 3, ] < totals &
total_sims[, 4, ] > totals) * 1)
prop_contained
We see that the proportion contained by the interval is near the 95% nominal value.
3.3 The National Longitudinal Study of Youth is documented at www.nlsinfo.org/nlsy97/nlsdocs/nlsy97/maintoc.html
From the website the sampling occurred in 2 phases (see image below).
The first phase screened households and the second identified eligible respondents. I think this would technically be called a two phase sample (discussed in chapter 8) but given the population sizes involved it may be that there was no need to account for the potential dependence here.
The sampling unit in the first stage came from NORC’s 1990 sampling design which used Standard Metropolitan Statistical Areas or non-metropolitan counties. That is, according to the general social survey documentation
Subsequent sampling units are segments, households, and then members of households.
I don’t know. It isn’t clear how to reconstruct Lumley’s example given that the data is not available. Looking at the NHIS website it looks like the strata and PSU information for a single stage approximation are given. When looking for the same information for the NLYS data I don’t see equivalent instructions on how to construct a single stage approximation. His description in the book here isn’t very forthcoming either but I’d guess that one could just concatenate the PSU and SSU labels and the same for the strata and then use those in a one stage design. This doesn’t look exactly like what Lumley does in his example.
3.4 The British Household Panel Survey (BHPS) is documented at http://www.iser.essex.ac.uk/survey/bhps (this website is no longer accurate) .
The website given is no longer accurate. Clicking through the documentation of the “Understanding Society” website it looks like the BHPS has been combined with other surveys. When I look at more documentation, (1, 2, 3), the last of these being the most pertinent, I see a variety of designs by wave. It looks like the design was a two stage stratified sample where the sampling frame was the Postcode Address File for Great Britian, excluding Northern Ireland. 250 postcodes were chosen as the primary sampling unit, in the second stage, “delivery points” which are roughly equivalent to addresses were then selected.
Again, my best guess is to concatenate the postcodes and delivery addresses to construct the single stage identifiers, and the same for the strata.
3.5 Statistics New Zealand lists survey topics at http://www.stats.govt.nz/datasets/a-z-list.htm . Find the Household Labour Force Survey.
Again, that site is no longer valid. The current help page for the household labor survey simply says,
Approximately fifteen thousand (15,000) households take part in this survey every three months. A house is selected using a fair statistical method to ensure the sample is an accurate representation of New Zealand. Every person aged 15 years or over living in a selected house needs to take part.
A different doc I found on google searching listed the 2016 survey design as a 2 stage design where samples are drawn from strata in the first stage. The PSU’s are “meshblocks” which appear to be the NZ equivalent of a U.S. census block / tract.
Same answer here as previously.
3.6 This exercise uses the Washington State Crime data for 2004 as the population. The data consists of crime rates and population size for the police districts (in cities/towns) and sheriffs’ offices (in unincorporated areas), grouped by county.
I took the data from this website. Specifically, this excel sheet. One of the tricky things about using this data was the fact that several police districts are reported as having zero associated population. I removed these from the dataset to make things simpler.
# data from
wa_crime_df <- readxl::read_xlsx("data/WA_crime/1984-2011.xlsx", skip = 4) %>%
filter(Year == "2004", Population > 0) %>%
mutate(
murder_and_crime = `Murder Total` + `Burglary Total`,
violent_crime = `Violent Crime Total`,
property_crime = `Property Crime Total`,
state_pop = sum(Population),
County = stringr::str_to_lower(County),
num_counties = n_distinct(County),
) %>%
group_by(County) %>%
mutate(num_agencies = n_distinct(Agency)) %>%
ungroup() %>%
select(
County, Agency, Population, murder_and_crime,
violent_crime, property_crime,
num_counties, num_agencies
)
true_count <- sum(wa_crime_df$murder_and_crime)
county_list <- unique(wa_crime_df$County)
county_sample <- sample(county_list, 10)
part_a <- wa_crime_df %>%
filter(County %in% county_sample) %>%
as_survey_design(
ids = c(County, Agency),
fpc = c(num_counties, num_agencies)
) %>%
summarize(total = survey_total(murder_and_crime)) %>%
mutate(Q = "a")
part_a
## # A tibble: 1 × 3
## total total_se Q
## <dbl> <dbl> <chr>
## 1 60407. 26174. a
county_sample <- sample(county_list[-which(county_list == "king")], 5)
county_sample <- c(county_sample, "king")
part_b <- wa_crime_df %>%
filter(County %in% county_sample) %>%
mutate(
strata_label = if_else(County == "king", "strata 1", "strata 2"),
num_counties = if_else(County == "king", 1, length(county_list) - 1)
) %>%
as_survey_design(
ids = c(County, Agency),
fpc = c(num_counties, num_agencies),
strata = strata_label
) %>%
summarize(
total = survey_total(murder_and_crime)
) %>%
mutate(Q = "b")
part_b
## # A tibble: 1 × 3
## total total_se Q
## <dbl> <dbl> <chr>
## 1 69768. 32535. b
This is a stratified two-stage sample design with no uncertainty in the first stage in one (king county) strata, and no uncertainty in the second stage in the other (non-king counties) strata.
king_districts <- wa_crime_df %>%
filter(County == "king") %>%
pull(Agency)
sampled_king_districts <- sample(king_districts, 5)
sampled_counties <- sample(county_list, 5)
part_c <- wa_crime_df %>%
filter(County %in% sampled_counties | Agency %in% sampled_king_districts) %>%
mutate(
strata_label = if_else(County == "king", "King County", "WA Counties"),
num_counties = if_else(County == "king", 1, length(county_list) - 1),
) %>%
as_survey_design(
id = c(County, Agency),
fpc = c(num_counties, num_agencies),
strata = strata_label
) %>%
summarize(total = survey_total(murder_and_crime, vartype = "se")) %>%
mutate(Q = "c")
part_c
## # A tibble: 1 × 3
## total total_se Q
## <dbl> <dbl> <chr>
## 1 26544 8515. c
pi <- sampling::inclusionprobabilities(a = wa_crime_df$Population, n = 10)
part_d <- wa_crime_df %>%
mutate(
pi = pi,
in_sample = sampling::UPbrewer(pi)
) %>%
filter(in_sample == 1) %>%
as_survey_design(probs = pi) %>%
summarize(total = survey_total(murder_and_crime)) %>%
mutate(Q = "d")
part_d
## # A tibble: 1 × 3
## total total_se Q
## <dbl> <dbl> <chr>
## 1 62561. 9372. d
This has the lowest standard error yet, likely since the sampling was specifically designed to capture high density districts.
county_sample <- sample(county_list, 10)
part_e <- wa_crime_df %>%
filter(County %in% county_sample) %>%
as_survey_design(
ids = c(County, Agency),
fpc = c(num_counties, num_agencies)
) %>%
summarize(total = survey_total(murder_and_crime)) %>%
mutate(Q = "e")
part_e
## # A tibble: 1 × 3
## total total_se Q
## <dbl> <dbl> <chr>
## 1 31231. 10580. e
I’ll compare the standard error of all the estimates now.
rbind(part_a, part_b, part_c, part_d, part_e) %>%
select(Q, total_se) %>%
gt::gt()
| Q | total_se |
|---|---|
| a | 26174.093 |
| b | 32534.860 |
| c | 8515.277 |
| d | 9372.016 |
| e | 10579.989 |
We see the highest variance in estimates from the simple random samples - Q’s a) and e). We see the lowest variance from part d) which used the proportional to size sampling scheme. Finally, we see the two stratified multi-stage samples have variances in between, all of which makes sense.
3.7 Use the household data from the 1966 SIP panel to estimate the
25th, 50th, 75th, 90th and 95th percentiles of income for households of
different sizes (ehhnumpp) averaged over the fourth months.
You will want to recode the large values of ehhnumpp to a a
single category. Describe the patterns you see.
For this question we effectively copy the code from the “Repeated Measures” section, looking at quantile by a recoded household size now, instead of month.
sipp_hh <- update(sipp_hh,
household_size = factor(
case_when(
ehhnumpp <= 8 ~ as.character(ehhnumpp),
TRUE ~ ">=9"
),
levels = c(as.character(1:8), ">=9")
)
)
qinc <- svyby(~thtotinc, ~household_size, svyquantile,
design = sipp_hh,
quantiles = c(0.25, 0.5, 0.75, 0.9, 0.95), se = TRUE
)
pltdf <- as_tibble(qinc) %>%
select(household_size, contains("thtotinc"), -contains("se.")) %>%
gather(everything(), -household_size, key = "quantile", value = "Total Income") %>%
mutate(quantile = as.numeric(str_extract(quantile, "[0-9].[0-9]?[0-9]")) * 100)
se <- as_tibble(qinc) %>%
select(household_size, contains("se.")) %>%
gather(everything(), -household_size, key = "quantile", value = "SE") %>%
mutate(quantile = as.numeric(str_extract(quantile, "[0-9].[0-9]?[0-9]")) * 100)
pltdf <- pltdf %>%
left_join(se) %>%
mutate(
lower = `Total Income` - 2 * SE,
upper = `Total Income` + 2 * SE
)
pltdf %>%
mutate(quantile = factor(quantile)) %>%
ggplot(aes(x = household_size, y = `Total Income`, color = quantile)) +
geom_pointrange(aes(ymin = lower, ymax = upper)) +
xlab("Household Size") +
ylab("Total Income (USD)") +
ggtitle("Total Income Quantiles",
subtitle = "Survey of Income and Program Participation"
)
In the plot above we can see that the income variability gets wider as household size increases but then plateaus at around ~ 5. Additionally, there are few samples with very large households so estimating the quantiles for those groups is increasingly noisy.
3.8 In the data from the 1996 SIPP panel a) What proportion of
households received any “means-tested cash benefits”
(thtrninc)? For those households that did receive benefits
what mean proportion of their total income came from these benefits?
db <- DBI::dbConnect(RSQLite::SQLite(), "Data/SIPP/sipp.db")
sipp_df <- tbl(db, sql("SELECT * FROM household")) %>%
dplyr::collect() %>%
select(ghlfsam, gvarstr, whfnwgt, thtotinc, thtrninc, tmthrnt) %>%
mutate(
benefit_recipient = I(thtrninc > 0) * 1,
thtrninc = as.numeric(thtrninc),
tmthrnt = as.numeric(tmthrnt)
)
sipp_hh_sub <- sipp_df %>%
as_survey_design(
id = ghlfsam, strata = gvarstr, nest = TRUE,
weight = whfnwgt
)
sipp_hh_sub %>%
summarize(prop_benefit_recipients = survey_mean(benefit_recipient))
## # A tibble: 1 × 2
## prop_benefit_recipients prop_benefit_recipients_se
## <dbl> <dbl>
## 1 0.0829 0.00174
sipp_hh_sub %>%
filter(benefit_recipient == 1) %>%
mutate(prop_benefit_income = thtrninc / thtotinc) %>%
summarize(
mn_prop_benefit_income = survey_mean(prop_benefit_income)
)
## # A tibble: 1 × 2
## mn_prop_benefit_income mn_prop_benefit_income_se
## <dbl> <dbl>
## 1 0.497 0.00710
tmthrnt? What
were the mean and the 75th and 95th percentiles of the proportion of
monthly income paid in rent? What proportion paid more than one third of
their income in rent?sipp_hh_sub %>%
mutate(
pays_rent = I(as.numeric(tmthrnt) > 0) * 1,
pct_income_rent = (tmthrnt / thtotinc) * 100,
rent_pct_gt_thrd = (pct_income_rent > 0.33) * 1
) %>%
# avoid divide by zero error
filter(thtotinc > 0) %>%
summarize(
prop_pays_rent = survey_mean(pays_rent),
mean_pct_income_rent = survey_mean(pct_income_rent, na.rm = TRUE),
quantile_pct_income_rent = survey_quantile(pct_income_rent,
na.rm = TRUE,
quantiles = c(.75, .95)
),
prop_rent_gt_thrd = survey_mean(rent_pct_gt_thrd, na.rm = TRUE)
) %>%
pivot_longer(cols = everything(), names_to = "Rent", values_to = "Estimates")
## # A tibble: 10 × 2
## Rent Estimates
## <chr> <dbl>
## 1 prop_pays_rent 0.0401
## 2 prop_pays_rent_se 0.00111
## 3 mean_pct_income_rent 2.24
## 4 mean_pct_income_rent_se 0.633
## 5 quantile_pct_income_rent_q75 0
## 6 quantile_pct_income_rent_q95 0
## 7 quantile_pct_income_rent_q75_se 1.42
## 8 quantile_pct_income_rent_q95_se 1.42
## 9 prop_rent_gt_thrd 0.0400
## 10 prop_rent_gt_thrd_se 0.00111
3.9 By the time you read this, the survey package is
likely to provide some approximations for PPS designs that use the
finite population size information. Repeat 3.2 using these.
OneSimulation <- function() {
insample <- sampling::UPtille(election$p)
ppsample <- election[insample == 1, ]
ppsample$wt <- 1 / ppsample$p
pps_design <- svydesign(id = ~1, weight = ~wt, data = ppsample, pps = "brewer")
total <- svytotal(~ Bush + Kerry + Nader, pps_design)
std_err <- sqrt(diag(attr(total, "var")))
ci <- confint(total)
out <- cbind(total, std_err, ci)
}
total_sims <- replicate(10, OneSimulation())
OneSimulation <- function() {
insample <- sampling::UPtille(election$p)
ppsample <- election[insample == 1, ]
ppsample$wt <- 1 / ppsample$p
ppsample$fpc <- 40 / sum(election$votes)
pps_design <- svydesign(id = ~1, weight = ~wt, fpc = ~fpc, data = ppsample, pps = "brewer")
total <- svytotal(~ Bush + Kerry + Nader, pps_design)
std_err <- sqrt(diag(attr(total, "var")))
ci <- confint(total)
out <- cbind(total, std_err, ci)
}
total_sims <- replicate(10, OneSimulation())
estimated_totals <- as_tibble(t(rowMeans(total_sims[, 1, ]))) %>%
mutate(label = "Estimate")
true_totals <- as_tibble(t(colSums(election[, c("Bush", "Kerry", "Nader")]))) %>%
mutate(label = "Truth")
rbind(estimated_totals, true_totals) %>%
gather(everything(), -label, key = "Candidate", value = "Vote Count") %>%
spread(label, `Vote Count`) %>%
gt::gt() %>%
gt::fmt_scientific() %>%
gt::tab_header("Simulated Total Comparison")
| Simulated Total Comparison | ||
| Candidate | Estimate | Truth |
|---|---|---|
| Bush | 5.96 × 107 | 5.96 × 107 |
| Kerry | 5.62 × 107 | 5.61 × 107 |
| Nader | 3.85 × 105 | 4.04 × 105 |
These are quite close.
b). Compute the mean of the estimated standard errors and compare it to the true simulation standard error, that is, the standard deviation of the estimated totals.
estimated_stderrs <- as_tibble(t(rowMeans(total_sims[, 2, ]))) %>%
mutate(label = "Estimate")
true_stderr <- as_tibble(apply(total_sims, 1, sd)) %>%
mutate(label = "Truth", Candidate = c("Bush", "Kerry", "Nader")) %>%
spread(Candidate, value)
rbind(estimated_stderrs, true_stderr) %>%
gather(everything(), -label, key = "Candidate", value = "Vote Std.Err") %>%
spread(label, `Vote Std.Err`) %>%
gt::gt() %>%
gt::fmt_scientific() %>%
gt::tab_header("Simulated Standard Error Comparison")
Estimate 95% confidence intervals for the population totals and compute the proportion of intervals that contain the true population value.
totals <- unlist(true_totals[, 1:3])
prop_contained <- rowMeans((total_sims[, 3, ] < totals &
total_sims[, 4, ] > totals) * 1)
prop_contained
3.10 Repeat 3.2 computing the Hartley-Rao approximation and the full Horvitz-Thompson estimate using the code and joint-probability data on the web site. Compare the standard errors from these two approximations to the standard errors from the single-stage with replacement approxmiation and to the true simulation standard error.
3.11 Since 1999, NHANES has been conducting surveys continuously with data released on a 2 year cycle. Each data set includes a weight variable for analyzing the two-year data; the weights add to the size of the US adult, civilian, non-institutionalized population.
What weight would be appropriate if three two-year data sets were used?
What weights would be appropriate when estimating changes, comparing the combined 1999-2000 and 2001-2002 data with the combined 2003-2004 and 2005-2006 data?
How would the answers differ if the goal was to estimate a population proportion rather than a total?
Lumley advocates for three principles in visualizing survey data:
Base the graph on an estimated population distribution.
Explicitly indicate weights on the graph.
Draw a simple random sample from the estimated population distribution and graph this sample instead.
All three of these strategies are meant to counteract the difficulty in visualizing survey data — the data available do not represent the population of interest without re-weighting.
While these principles are great and worth following, I found Lumley’s demonstrations in this chapter a little out of date. Thanks to the ggplot family of libraries and shiny there’s been huge advances in visualizing and interacting with data of all types – survey included. What’s more, it isn’t always clear how he produces the charts in the chapter. Consequently I’ve done my best to reproduce the same charts, or a chart I’d consider appropriate for the question / survey at hand in the following charts.
TODO: Talk about ggsurvey.
Lumley recommends using bar charts, forest plots and fourfold plots to visualize the data from a table. I would agree that all of these are “fine” in a certain context but strongly prefer the forest plot and variations on it personally, since it does more than all the other to include representations of uncertainty about the tabulated estimates.
medians %>%
ggplot(aes(x = racehpr, y = Median_BMI, fill = srsex)) +
geom_bar(stat = "identity", position = "dodge") +
ylab("Median BMI (kg/m^2)") +
xlab("") +
theme(
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 0.25, hjust = .25)
) +
ggtitle("Median BMI Across Race, Sex")
medians %>%
ggplot(aes(x = Median_BMI, y = racehpr, color = srsex)) +
geom_pointrange(aes(xmin = Median_BMI_low, xmax = Median_BMI_upp),
position = position_dodge(width = .25)
) +
geom_vline(aes(xintercept = 25), linetype = 2, color = "red") +
xlab("BMI") +
ylab("") +
theme(legend.title = element_blank(), legend.position = "top") +
ggtitle("Median BMI by Race/Ethnicity and gender, from CHIS") +
labs(caption = str_c(
"Line indicates CDC border between healthy and ",
"overweight BMI."
))
Lumley argues that the most straightforward graphical displays for a single continuous variable are boxplots, cumulative distribution function (cdf) plots and survival curves. I think histograms are also useful but we’ll get to that shortly.
Below I reproduce Figure 4.6 from the text in which Lumley
demonstrates the difference between the weighted and unweighted CDF’s of
the California school data demonstrating that, indeed, the design based
estimate is closer to the population value than the unweighted, naive
estimate of the stratified design. This should all intuitively make
sense. My only complaint would be that the uncertainty of the cdf isn’t
visualized when that should be just as easily accessible. I made a
cursory investigation to see if these data were available in the
svycdf object but they were not.
cdf.est <- svycdf(~ enroll + api00 + api99, dstrata)
cdf.pop <- ecdf(apipop$enroll)
cdf.samp <- ecdf(apistrat$enroll)
par(mar = c(4.1, 4.1, 1.1, 2.1))
plot(cdf.pop,
do.points = FALSE, xlab = "enrollment",
ylab = "Cumulative probability",
main = "Cumulative Distribution of California School Size",
sub = "Reproduces Lumley Fig 4.6", lwd = 1
)
lines(cdf.samp, do.points = FALSE, lwd = 2)
lines(cdf.est[[1]],
lwd = 2, col.vert = "grey60", col.hor = "grey60",
do.points = FALSE
)
legend("bottomright",
lwd = c(1, 2, 2), bty = "n",
col = c("black", "grey60", "black"),
legend = c("Population", "Weighted estimate", "Unweighted Estimate")
)
Next Lumley plots an adjusted kaplan
meier curve using
svykm() on the National Wilms Tumor
Study data. Unfortunately he doesn’t show how to create the design
object and I find no mention of it elsewhere in the book. I made a guess
and it seems to be accurate. Note that I obtained the national wilms
tumor study data from the survival
package.
dcchs <- svydesign(ids = ~1, data = nwtco)
scurves <- svykm(Surv(edrel / 365.25, rel) ~ histol, dcchs)
plot(scurves)
The interpretation of this graph is that relapse (into cancer) appears
to occur within about three years, if at all. The top line shows the
survival for those with a favorable histology classification with an
even better survival rate.
Lumley makes a note here describing how the sampling weights make a large difference in the estimate because the design is a case cohort sample. That being said, I don’t really see the difference when I compare his figure to mine created with an equal probability sample… An error here perhaps.
Lumley describes the typical boxplots as
based on the quartiles of the data: the box shows the median and first and third quartiles, the whiskers extend out to the last observation within 1.5 interquartile ranges of the box ends, and all points beyond the whiskers are shown individually.
This is as good a summary as any, I’ll only note that the median is
the line within the interior of the box and the first and third quartile
define the boundaries of the box. It’s (somewhat) easy to image how
Lumley estimates these values, as he has already introduced the
svyquantile() function — this should be a straightforward
application of that function.
I reproduce Figure 4.10 from the text using the nhanes data and design object created in answering the questions for chapter 3. His code is equivalent.
## Lumley's code
# nhanes <- svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA,
# weights = ~WTMEC2YR, data = both, nest = TRUE)
## I use the bpdf design from the chapter 3 questions
nhanes <- bpdf_design %>%
mutate(
age_group = cut(RIDAGEYR, c(0, 20, 30, 40, 50, 60, 80, Inf)),
)
svyboxplot(BPXSAR ~ age_group, subset(nhanes, BPXSAR > 0),
col = "gray80",
varwidth = TRUE, ylab = "Systolic BP (mmHg)", xlab = "Age"
)
The plot shows the distribution in blood pressure across the different
age groups. I already showed how I visualized this previously, but it is
nice to know how to do something multiple ways. In general, our
observation shows an increase in systolic blood pressure medians and
third quartiles across age.
Histograms and kernel density estimators both visualize the probability distribution function(pdf) of a given dataset. Confusingly, perhaps, there is no pdf to estimate in finite population inference because the sampling is random instead of the data. Still, the analogous effort can be made where a given proportion is estimated for some number of bins of the variable of interest. Instead of estimating a naive proportion, there’s now a sample based proportion. I reproduce Lumley’s Figure 4.12 below demonstrating how this works.
svyhist(~BPXSAR, subset(nhanes, RIDAGEYR > 20 & BPXSAR > 0),
main = "",
col = "grey80", xlab = "Systolic BP (mmHg)"
)
lines(svysmooth(~BPXSAR, nhanes,
bandwidth = 5,
subset = subset(nhanes, RIDAGEYR > 20 & BPXSAR > 0)
), lwd = 2)
This looks considerably different then the naive estimate, again showcasing the need for taking the design into account.
hist(bpdf_design$variables$BPXSAR,
xlab = "Systolic BP (mmHg)",
main = "Naive Distribution of Blood Pressure"
)
Where the design could be used to adjust the raw sample summaries to adjust for the design with one variable, two variable visualizations make this much more difficult as there is no way to incorporate the design information into the presented estimate. An additional dimension has to be used to illustrate the design information.
To that end, Lumley uses a bubbleplot with the size of the “bubbles” proportional to the sampling weight. This way, the viewer can identify which points should be considered as “more” or less impactful in terms of representing the population. I reproduce part of Figures 4.13 - 4.15 from the text below, for those plots in which Lumley includes the code. plots the systolic and diastolic blood pressure from the NHANES design.
# TODO(petersonadam): Try plotting the api data here too
par(mfrow = c(1, 2))
svyplot(BPXSAR ~ BPXDAR,
design = nhanes, style = "bubble",
xlab = "Diastolic pressure (mmHg)",
ylab = "Systolic pressure (mmHg)"
)
svyplot(BPXSAR ~ BPXDAR,
design = nhanes, style = "transparent",
pch = 19, alpha = c(0, 0.5),
xlab = "Diastolic pressure (mmHg)",
ylab = "Systolic pressure (mmHg)"
)
As Lumley notes, this is a case in which the sampling weights — bubble size — is not particularly informative, as we don’t see any particularly obvious pattern relating the size of the bubbles to the two variables of interest. Turning to the shading plot, though we can now see more than just the “blob” in the first plot, there’s still no apparent pattern between the bubble size and the two variables to identify.
Below, I include Lumley’s “subsample” method, which draws a simple random sample from the target population by using the provided sampling probabilities. Lumley reccomends looking at 2 or three iterations of a subsample plot in order to ensure that any features visualized are not “noise” from the sampling process.
svyplot(BPXSAR ~ BPXDAR,
design = nhanes, style = "subsample",
sample.size = 1000,
xlab = "Diastolic pressure (mmHg)",
ylab = "Systolic pressure (mmHg)"
)
#### Aggregation and smoothing
While less of an issue than it was at the time of Lumley’s writing, aggregating and smoothing across points makes it easier to condense the number of sample points in a visualization to reduce the memory required to visualize all the points in the sample. In a design setting it also provides an opportunity to incorporate the design information into a visualization specific estimate.
Lumley’s first example of this is a hexplot - a plot where a grid of hexagons is created and sized according to the number of points within the hex-bin.
svyplot(BPXSAR ~ BPXDAR,
design = nhanes, style = "hex", legend = 0,
xlab = "Diastolic pressure (mmHg)",
ylab = "Systolic pressure (mmHg)"
)
This is perhaps the best visualization of the data thus far, where the “blob” is still apparent but outliers are still visible and a more concise summary of the data is easily visible.
Lumley next describes how to take the discrete estimates examining
blood pressure as a function of age and smooth them when both variables
are continuous. The code below shows how to do this by using the
svyplot() function using quantile regression methods from
the quantreg
package. Mean estimates can also be obtained by using
method = 'locpoly'.
adults <- subset(nhanes, !is.na(BPXSAR))
a25 <- svysmooth(BPXSAR ~ RIDAGEYR,
method = "quantreg", design = adults,
quantile = .25, df = 4
)
a50 <- svysmooth(BPXSAR ~ RIDAGEYR,
method = "quantreg", design = adults,
quantile = .5, df = 4
)
a75 <- svysmooth(BPXSAR ~ RIDAGEYR,
method = "quantreg", design = adults,
quantile = .75, df = 4
)
a10 <- svysmooth(BPXSAR ~ RIDAGEYR,
method = "quantreg", design = adults,
quantile = .1, df = 4
)
a90 <- svysmooth(BPXSAR ~ RIDAGEYR,
method = "quantreg", design = adults,
quantile = .9, df = 4
)
plot(BPXSAR ~ RIDAGEYR,
data = nhanes, type = "n", ylim = c(80, 200),
xlab = "Age", ylab = "Systolic Pressure"
)
lines(a50, lwd = 3)
lines(a25, lwd = 1)
lines(a75, lwd = 1)
lines(a10, lty = 3)
lines(a90, lty = 3)
legend("topleft",
legend = c("10%,90%", "25%, 75%", "median"),
lwd = c(1, 1, 3), lty = c(3, 1, 1), bty = "n"
)
A conditioning plot is effectively a scatter plot with a third
variable fixed. This third variable is then displayed in the facet title
of the plot. In Lumley’s text he shows how to do this using a call to
the svycoplot() function to recreate the top half of Figure
4.20. Of note, the additional hexscale argument can be fed
the "absolute" argument to make the scales comparable
between panels - by default the scales are facet specific, though the
data (for a continuous facet) has a ~50% overlap so the scales are not
dramatically different.
svycoplot(BPXSAR ~ BPXDAR | equal.count(RIDAGEMN),
style = "hex", # or "transparent" for shaded hexs,
# hexscale = "absolute" # for fixed scales across facets.
design = subset(nhanes, BPXDAR > 0), xbins = 20,
strip = strip.custom(var.name = "AGE"),
xlab = "Diastolic pressure (mm Hg)",
ylab = "Systolic pressure (mm Hg)"
)
The final section in this chapter looks at how to visualize survey data spatially across a map. Since many surveys contain geographic information in both their sampling design and the questions they seek to answer, it makes sense that one might want to visualize estimates at some geographic scale.
Lumley uses the maptools R package to take estimates
computed using techniques/functions already demonstrated and visualize
them on maps.
Since maptools isn’t available on CRAN at the current time of
writing, I’ll use the sf
package and ggplot2.
Below I reproduce Figure 4.21 from the text using these packages with the same design based estimates in the text. The data was procured from Lumley’s website — search for “BRFSS”.
states <- read_sf("Data/BRFSS2007/brfss_state_2007_download.shp") %>%
arrange(ST_FIPS)
bys <- svyby(~X_FRTSERV, ~X_STATE, svymean,
design = subset(brfss, X_FRTSERV < 99999)
)
state_fruit_servings <- states %>%
select(ST_FIPS) %>%
st_drop_geometry() %>%
left_join(bys, by = c("ST_FIPS" = "X_STATE")) %>%
mutate(geometry = states$geometry) %>%
st_as_sf()
state_fruit_servings %>%
ggplot() +
geom_sf(aes(fill = X_FRTSERV / 100)) +
theme_void() +
theme(legend.title = element_blank()) +
ggtitle("Servings of fruit per day, from BRFSS 2007")
hlth <- brfss %>%
as_survey_design() %>%
mutate(
agegp = cut(AGE, c(0, 35, 50, 65, Inf)),
state = X_STATE,
covered = (HLTHPLAN == 1) * 1
) %>%
group_by(agegp, state) %>%
summarize(
health_coverage = survey_mean(covered)
) %>%
## Formatting
mutate(Age = case_when(
agegp == "(0,35]" ~ "<35",
agegp == "(35,50]" ~ "35-50",
agegp == "(50,65]" ~ "50-65",
agegp == "(65,Inf]" ~ "65+"
))
insurance_coverage <- states %>%
select(ST_FIPS, geometry) %>%
left_join(hlth, by = c("ST_FIPS" = "state")) %>%
st_as_sf()
insurance_coverage %>%
ggplot(aes(fill = health_coverage)) +
geom_sf() +
facet_wrap(~Age) +
theme_void() +
theme(legend.title = element_blank())
Lumley then shows two plots with insurance coverage at the state level. Looking at the code he posted on the web it isn’t clear to me what he’s estimating that’s different here as I don’t see any calls to any of the survey functions. Consequently, I’ve just created some fake simulated data and created a plot with the equivalent data.
cities <- read_sf("Data/BRFSS2007/BRFSS_MMSA_2007.shp") %>%
filter(NAME != "Columbus") %>%
transmute(Insurance = cut(rbeta(n(), 1, 1), c(0, 0.25, .5, .75, 1)))
marginal_insurance <- brfss %>%
as_survey_design() %>%
mutate(
covered = (HLTHPLAN == 1) * 1,
state = X_STATE,
) %>%
group_by(state) %>%
summarize(
health_coverage = survey_mean(covered)
) %>%
ungroup()
map_data <- states %>%
select(ST_FIPS, geometry) %>%
left_join(marginal_insurance, by = c("ST_FIPS" = "state")) %>%
st_as_sf()
ggplot() +
geom_sf(data = map_data, aes(fill = health_coverage)) +
geom_sf(data = cities, aes(color = Insurance)) +
theme_void() +
theme(legend.title = element_blank()) +
ggtitle("Insurance Coverage - from BRFSS and Fake Data")
svyboxplot(bmi_p ~ racehpr, chis, main = "BMI By Race/Ethnicity")
svyboxplot(bmi_p ~ srsex, chis, main = "BMI BY Sex")
Using the code in Figure 3.8 draw a barplot of the quantiles of income and compare it to the dotchart in Figure 3.9. What are some advantages and disadvantages of each display.
Use svysmooth() to draw a graph showing change in
systolic and diastolic blood pressure over time in the NHANES 2003-2004
data. Can you see the change to isolated systolic hypertension in old
age that is shown in Figure 4.5.
plot(svysmooth(BPXSAR ~ RIDAGEYR, nhanes))
With the data from the SIPP 1996 panel draw the cumulative distribution function densityfunction, a histogram and a boxplot of total household income. Compare these graphs for their usefulness in showing the distribution of income.
With the data from the SIPP 1996 panel draw a graph showing
amount of rent (tmthrent) and proportion of income paid to
rent. You will want to exclude some outlying households that report much
higher rent than income.
Using data from CHIS 2005 (see section 2.3.1) examine how body mass index varies with age as we did with blood pressure in this chapter.
plot(svysmooth(bmi_p ~ srage_p, chis),
xlab = "Age", ylab = "BMI",
main = "BMI vs. Age (CHIS data)"
)
We see a very distinctive U-shape - with middle aged indviduals - 50 to 60 year olds - having the highest average BMI, but the young and old having lower BMI’s.
The left-hand panel of Figure 3.3 shows an interesting two-lobed pattern. Can you find what makes these lobes?
Set up a survey design for the BRFSS 2007 data as in Figure 3.7.
BRFSS measured annual income (income2) in categories <
$10K, $10-15k $20-25k, $25-35k, $35-50k, $50-75k, > $75k and race
(orace) :white,black, asian, native hawaiian/pacific
island, native american, other.
df argument to the
code in Figure 4.19Lumley identifies two main uses for regression in analyzing complex surveys:
In contrast to model based inference which typically discusses linear regression in the context of distributional assumptions on the outcome variable, Lumley notes that his discussion of regression is still within the design-based philosophy and consequently no model based assumptions are needed in order to compute valid 95% confidence intervals. However, model choice still matters insofar as one’s goal is to precisely estimate a relationship or population mean or total.
In statistical parlance, Lumley is using GEE - or generalized estimating equations to fit models, which only require assumptions about the moments of the underlying data distribution, rather than the family of the distribution.
My take is that this approach also allows for more flexibility in how
the variance of the model can be estimated. Since the variance
associated with a complex design is a function of the design itself, GEE
makes it easier to structure the estimating equations in such a way that
the variance computation corresponds to the design. In fact, Lumley
explicitly notes in section 5.2.1 that the heteroskedastic / sandwhich
based variance estimators which came out of GEE are used by the
survey package, with special handling for combining
variance within/across strata.
Ratio estimation comes up first in this chapter because it is important in estimating (1) a population mean/total, (2) a ratio directly or (3) a subpopulation estimate of a mean.
Lumley illustrates how to estimate ratios in design methods by using the api dataset and estimating the proportion of students who took the Academic Performance Index exam.
svyratio(~api.stu, ~enroll, strat_design)
## Ratio estimator: svyratio.survey.design2(~api.stu, ~enroll, strat_design)
## Ratios=
## enroll
## api.stu 0.8369569
## SEs=
## enroll
## api.stu 0.007757103
This estimate does a good job of estimating the true population total, 0.84, which we happen to have access to for this example.
It’s worth noting here as Lumley does that ratio estimates are not unbiased but are classified as “approximately unbiased” since the bias decreases proportional to the sample size and is consequently smaller than the standard error – which decreases proportional to \(\frac{1}{\sqrt{n}}\).
In the case where individual - not aggregate - data are available the
ratio being estimated is simply a proportion. This is the same logic for
which subpopulation estimates had been calculated previously in [Chapter
2:Simple Random Sample] via the svyby() function. These
estimates require special handling - though in the survey package they
can all be calculated via svymean(),
svyratio() and svycontrast() which I show
below using the same designe object from Chapter 2.
It is worth noting before doing so however, that the special handling needed here follows from the fact that both the numerator and the denominator are estimated. Lumley delves into this in the appendix which I’ll reproduce here alongside questions I have that I’ll look to return to in the future.
We’ll define the subpopulation estimate of interest using the indicator function \(Y_i = X_iI(Z_I > 0)\), where \(I(Z_i > 0) = 1\) for members of the subpopulation and 0 otherwise, and \(X_i\) is the measurement of interest.
The variance estimate using replicate weights can be calculated similar to the typical variance estimate, those replicate weights belonging to sample observations outside the subpopulation are simply not used - this again highlights the utility of replicate weights.
For linearization - that is using a taylor series to estimate the variance - the value becomes more complicated, following Lumley’s appendix the HTE is defined as:
\[ \hat{V}[\hat{T}_Y] = \sum_{i,j} \left(\frac{Y_i Y_j}{\pi_ij} - \frac{Y_i}{\pi_i} \frac{Y_j}{\pi_j} \right) \\ = \sum_{Z_i,Z_j > 0} \left(\frac{Y_i Y_j}{\pi_ij} - \frac{Y_i}{\pi_i} \frac{Y_j}{\pi_j} \right). \]
Here however, Lumley states
but the simplified computational formulas for special designs are not the same.
which I don’t completely understand. I suppose he means for clustered or multiphase designs things but it isn’t clear as he goes onto say
for example, the formula for the variance of a total under simple random sampling (equation 2.2)
\[ V[\hat{T}_X] = \frac{N-n}{N} \times N^2 \times \frac{V[X]}{n} \]
cannot be replaced by
\[ V[\hat{T}_Y] \stackrel{?}{=} \frac{N-n}{N} \times N^2 \times \frac{V[X]}{n} \]
or even, defining \(n_D\) as the number sampled in the subpopulation, by
\[ \stackrel{?}{=} \frac{N-n_D}{N} \times N^2 \times \frac{V[X]}{n_D} \]
In order to use these simplified formulas it is necessary to work with the variable \(Y\) and use
\[ V[\hat{T}_Y] = \frac{N-n}{N} \times N^2 \times \frac{V[Y]}{n} \]
Its not clear in this last expression if we’re simply back to the initial expression that couldn’t be used, or if we’re using the smaller sample subset again for variance computations but Lumley’s next text suggests that’s the case:
Operationally, this means that variance estimation in a subset of a survey design object in R needs to involve the \(n - n_D\) zero contributions to an estimation equation.
I hope to shed more light on what’s going on here in the future but for now its clear why this is in the appendix, but not exactly clear to me why observations outside the subpopulation are simply zero’d in variance computation.
Lumley uses the following three function calls to illustrate three
different ways to estimate a ratio (1) A call to svymean(),
(2) A call to svyratio() and (3) `
svymean(~bmi_p, subset(chis, srsex == "MALE" & racehpr == "AFRICAN AMERICAN"))
## mean SE
## bmi_p 28.019 0.2663
chis <- update(chis,
is_aamale = (srsex == "MALE" & racehpr == "AFRICAN AMERICAN")
)
svyratio(~ I(is_aamale * bmi_p), ~is_aamale, chis)
## Ratio estimator: svyratio.svyrep.design(~I(is_aamale * bmi_p), ~is_aamale, chis)
## Ratios=
## is_aamale
## I(is_aamale * bmi_p) 28.01857
## SEs=
## [,1]
## [1,] 0.266306
totals <- svytotal(~ I(bmi_p * is_aamale) + is_aamale, chis)
totals
## total SE
## I(bmi_p * is_aamale) 19348927 260401.7
## is_aamaleFALSE 25697039 6432.3
## is_aamaleTRUE 690575 6341.9
svycontrast(
totals,
quote(`I(bmi_p * is_aamale)` / `is_aamaleTRUE`)
)
## nlcon SE
## contrast 28.019 0.2663
The third use Lumley lists for the use of ratio estimators is to
construct more accurate estimates of population means or totals. His
motivating example is to take the ratio estimate of individuals who took
the API tests and then use that to determine the approximate number of
students who took the test, by multiplying the ratio estimate by the
number of students. This can be done by hand or via the survey package
predict() function used in conjunction with
svyratio().
r <- svyratio(~api.stu, ~enroll, strat_design)
predict(r, total = sum(apipop$enroll, na.rm = TRUE))
## $total
## enroll
## api.stu 3190038
##
## $se
## enroll
## api.stu 29565.98
Lumley uses this as a jumping off point to discuss linear regression since one can imagine the relationship between the number of students taking the API test as being roughly proportional to the number of students enrolled at the schools; \(E[tests_i] = \alpha \times \text{enrollment}_i + \epsilon_i\) where \(E[\epsilon] = 0\) (note that this is an assumption about the moment of the error distribution and not the shape of the error distribution itself).
A quick review of the moment assumptions for linear regression - which is all that are needed for designed based estimation.
If we have random response variable \(Y\) and explanatory variable \(X\) then linear regression is looking at the relationship between the expectation of \(Y\) and \(X\): \[ E[Y] = \alpha + X \beta, \]
Where \(\alpha\) is a constant offset, also called the intercept and \(\beta\) is the slope coefficient that describes the change in \(E[Y]\) per unit change in \(X\). Here I’m referring to \(X\) and \(\beta\) as singular variables but they could also be a matrix and vector of explanatory variables and slope coefficients, respectively. The variance of the response variable, \(Y\) is assumed to be constant, i.e. \(V[Y] = \sigma^2\), unless otherwise modeled.
Following the standard OLS estimation procedure, we’d normally minimize the squared error and Lumley notes that if we have the complete population data, we’d be finished at that:
\[ RSS = \sum_{i=1}^{N} (Y_i - \alpha X_i\beta)^{2}. \]
Given that we’re typically dealing with (complex) samples though, we have to adjust the estimates to account for the weighting:
\[ \hat{RSS} = \sum_{i=1}^{n} \frac{1}{\pi_i}(Y_i - \alpha - X_i \beta)^2, \]
so each error term is up weighted according to its sampling probability.
Lumley’s ratio estimator of a population total described previously derives from the linear regression model with a single predictor and no intercept. The separate ratio estimator, similar to the cell means model estimates a ratio for each stratum and the estimate of the total is the sum of the denominator for all strata. Formally,
\[ E[Y_i] = \beta_k \times X_i \times \{i \in \text{ stratum } k\}, \]
where \(\beta_k\) is the ratio for stratum \(k\) estimates the given quantity. This setup can provide more precise estimates than the single ratio estimator when the sample size gets large and the strata are able to better explain the outcome variable. However, if the strata don’t have any correlation with the outcome then the standard errors increase, due to the need to estimate the extra parameters.
THIS EXAMPLE DOESN’T MAKE COMPLETE SENSE. Lumley illustrates this latter phenomenon with the California school data - using the percentage of english language learners as a predictor of the overall number of students taking the api tests.
sep <- svyratio(~api.stu, ~enroll, strat_design, separate = TRUE)
com <- svyratio(~api.stu, ~enroll, strat_design)
stratum_totals <- list(E = 1877350, H = 1013824, M = 920298)
predict(sep, total = stratum_totals)
## $total
## enroll
## api.stu 3190022
##
## $se
## enroll
## api.stu 29756.44
predict(com, total = sum(unlist(stratum_totals)))
## $total
## enroll
## api.stu 3190038
##
## $se
## enroll
## api.stu 29565.98
We see the common ratio has a smaller standard error than the separate ratio estimator.
svyby(~api.stu, ~stype, design = dstrata, denom = ~enroll, svyratio)
## stype api.stu/enroll se.api.stu/enroll
## E E 0.8558562 0.006034685
## H H 0.7543378 0.031470156
## M M 0.8331047 0.017694634
Incomes in Scotland
Lumley works through an example looking at household incomes across Scotland from their national household survey. Unfortunately both the dataset subset he provides at his website to and the full dataset he links to don’t have the variables he uses in his example code in Figure 5.7. For example the code he provides is filtered using an ADULTH variable which isn’t found in either dataset.
load("Data/SHS/shs.rda") # Lumley's website data
colnames(shs$variables)
## [1] "psu" "uniqid" "ind_wt" "shs_6cla" "council" "rc5"
## [7] "rc7e" "rc7g" "intuse" "groupinc" "clust" "stratum"
## [13] "age" "sex" "emp_sta" "grosswt" "groc"
load("Data/SHS/ex2.RData") # PEAS "full Data" website
colnames(shs)
## [1] "PSU" "UNIQID" "IND_WT" "SHS_6CLA" "COUNCIL" "RC5"
## [7] "RC7E" "RC7G" "INTUSE" "GROUPINC" "CLUST" "STRATUM"
## [13] "AGE" "SEX" "EMP_STA" "GROSSWT" "GROC"
Consequently, I won’t reproduce this example here, except to say that he uses the example to illustrate how oversampling certain strata (poorer households) can improve the precision associated with the household income estimate and that using the population information increased the precision of the weekly household income estimate via linear regression – for those sub-populations for which population information is available.
US Elections
Similarly it doesn’t look like the dataset included in the current
survey R package has data for the 2008 election - I only
see Bush / McCain vote totals.
data(elections)
colnames(election)
## [1] "County" "TotPrecincts" "PrecinctsReporting"
## [4] "Bush" "Kerry" "Nader"
## [7] "votes" "p"
Consequently I can’t do the analysis he shows predicting 2008 votes using 2000 votes. It’ll hopefully suffice to say that in theme with the content for the chapter, that because these values are correlated — 2000 vote % for republican candidate and 2008 vote % for republican candidate — it stands to reason that we can reduce the variance of the resulting estimate rather than using the 2008 data alone.
Lumley describes three categories for describing why a predictor might be included in a regression, noting that this may help the model fit better and thus aid in reducing bias from a probability sample that results from, say non-response.
Exposure of interest: If we’re interested in a specific variable’s impact on a variable, it makes sense to include that in a model to estimate the relationship.
Confounding variables: A variable may not be of primary interest, but may be associated with both the outcome variable and exposure of interest. Consequently, this will need to be adjusted for in order to isolate the effect of interest.
Precision variables: These are, again, associated with the outcome variable of interest, but not associated with the exposure of interest. However, because of their association alone, they can increase the precision with which the exposure effect is estimated.
Lumley goes on to describe methods for model selection which I’ll leave for the text.
survey packageExample:Dietary sodium and potassium and blood pressure
demo <- haven::read_xpt("data/nhanesxpt/demo_c.xpt")[, c(1:8, 28:31)]
bp <- haven::read_xpt("data/nhanesxpt/bpx_c.xpt")
bm <- haven::read_xpt("data/nhanesxpt/bmx_c.xpt")[, c("SEQN", "BMXBMI")]
diet <- haven::read_xpt("data/nhanesxpt/dr1tot_c.xpt")[, c(1:52, 63, 64)]
nhanes34 <- merge(demo, bp, by = "SEQN")
nhanes34 <- merge(nhanes34, bm, by = "SEQN")
nhanes34 <- merge(nhanes34, diet, by = "SEQN")
demo5 <- haven::read_xpt("data/nhanesxpt/demo_d.xpt")[, c(1:8, 39:42)]
bp5 <- haven::read_xpt("data/nhanesxpt/bpx_x.xpt")
bp5$BPXSAR <- rowMeans(bp5[, c("BPXSY1", "BPXSY2", "BPXSY3", "BPXSY4")],
na.rm = TRUE
)
bp5$BPXDAR <- rowMeans(bp5[, c("BPXDI1", "BPXDI2", "BPXDI3", "BPXDI4")],
na.rm = TRUE
)
bm5 <- haven::read_xpt("data/nhanesxpt/bmx_d.xpt")[, c("SEQN", "BMXBMI")]
diet5 <- haven::read_xpt("data/nhanesxpt/dr1tot_d.xpt")[, c(1:52, 64, 65)]
nhanes56 <- merge(demo5, bp5, by = "SEQN")
nhanes56 <- merge(nhanes56, bm5, by = "SEQN")
nhanes56 <- merge(nhanes56, diet5, by = "SEQN")
nhanes <- rbind(nhanes34, nhanes56)
nhanes$fouryearwt <- nhanes$WTDRD1 / 2
# I added the two lines below to make graphing the
# smooth plot easier
nhanes$sodium <- nhanes$DR1TSODI / 1000
nhanes$potassium <- nhanes$DR1TPOTA / 1000
des <- svydesign(
id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~fouryearwt,
nest = TRUE, data = subset(nhanes, !is.na(WTDRD1))
)
des <- update(des, sodium = DR1TSODI / 1000, potassium = DR1TPOTA / 1000)
des
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (60) clusters.
## update(des, sodium = DR1TSODI/1000, potassium = DR1TPOTA/1000)
Lumley uses the following plot — examining systolic blood pressure as a function of daily sodium intake — to motivate the need to adjust for confounders. As we can see below in the reproduced Figure 5.10, there doesn’t appear to be much a relationship between sodium intake and average blood pressure. Lumley argues that we observe this simpler relationship because of the association between sodium and blood pressure is confounded by age.
plot(BPXSAR ~ sodium, data = nhanes, type = "n")
points(svyplot(BPXSAR ~ sodium,
design = des, style = "transparent", xlab =
"Dietary Sodium (g/day)", ylab = "Systolic Blood Pressure (mm Hg)"
))
lines((svysmooth(BPXSAR ~ sodium, des)))
To test this hypothesis, Lumley first visualizes the three variables
using the conditional plot demonstrated previously.
svycoplot(BPXSAR ~ sodium | equal.count(RIDAGEYR), des,
style = "hexbin",
xlab = "Dietary Sodium (g/day)",
ylab = "Systolic BP (mmHg)",
strip = strip.custom(var.name = "Age")
)
In the above plot we see a greater indication that as dietary sodium increases , so too does systolic blood pressure.
To more formally test the hypothesis, Lumley fits several models with these variables included. The first two just include (1) sodium and potassium and (2) sodium, potassium and Age.
model0 <- svyglm(BPXSAR ~ sodium + potassium, design = des)
summary(model0)
##
## Call:
## svyglm(formula = BPXSAR ~ sodium + potassium, design = des)
##
## Survey design:
## update(des, sodium = DR1TSODI/1000, potassium = DR1TPOTA/1000)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 120.3899 0.7105 169.436 < 2e-16 ***
## sodium -0.6907 0.1658 -4.166 0.000268 ***
## potassium 0.7750 0.2655 2.919 0.006853 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 382.6159)
##
## Number of Fisher Scoring iterations: 2
model1 <- svyglm(BPXSAR ~ sodium + potassium + RIDAGEYR, design = des)
summary(model1)
##
## Call:
## svyglm(formula = BPXSAR ~ sodium + potassium + RIDAGEYR, design = des)
##
## Survey design:
## update(des, sodium = DR1TSODI/1000, potassium = DR1TPOTA/1000)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.73535 0.79568 125.346 < 2e-16 ***
## sodium 0.79846 0.14866 5.371 1.13e-05 ***
## potassium -0.91148 0.18994 -4.799 5.23e-05 ***
## RIDAGEYR 0.49561 0.01169 42.404 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 275.4651)
##
## Number of Fisher Scoring iterations: 2
As we can see, the sodium and potassium coefficient signs in the first model switch direction once age is included in the second model, demonstrating that the two variables are associated with both age and systolic blood pressure. The second model makes more sense intuitively, because we expect systolic blood pressure to increase, on average, as a function of sodium intake.
Lumley adds a few more possible confounders to model2
and then tests to see whether the effects of daily dietary sodium and
potassium on systolic blood pressure are significantly different than
zero using the regTermTest() function.
model2 <- svyglm(BPXSAR ~ sodium + potassium + RIDAGEYR + RIAGENDR + BMXBMI,
design = des
)
summary(model2)
##
## Call:
## svyglm(formula = BPXSAR ~ sodium + potassium + RIDAGEYR + RIAGENDR +
## BMXBMI, design = des)
##
## Survey design:
## update(des, sodium = DR1TSODI/1000, potassium = DR1TPOTA/1000)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 97.32903 1.42668 68.220 < 2e-16 ***
## sodium 0.43458 0.16164 2.689 0.0126 *
## potassium -0.96119 0.17043 -5.640 7.19e-06 ***
## RIDAGEYR 0.45791 0.01080 42.380 < 2e-16 ***
## RIAGENDR -3.38208 0.38403 -8.807 3.90e-09 ***
## BMXBMI 0.38460 0.03797 10.129 2.48e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 263.5461)
##
## Number of Fisher Scoring iterations: 2
regTermTest(model2, ~ potassium + sodium, df = NULL)
## Wald test for potassium sodium
## in svyglm(formula = BPXSAR ~ sodium + potassium + RIDAGEYR + RIAGENDR +
## BMXBMI, design = des)
## F = 15.98481 on 2 and 25 df: p= 3.3784e-05
The test is formally examining whether a model with these terms is more likely, given the data, then one without, with the null hypothesis assuming that the extra terms are unneccessary. As we can see, the model is very unlikely to fit so well with the two extra terms, so there’s evidence to support the association between the terms and blood pressure.
Lumley digs into further details of why the effect is so small — 1 gram of sodium (2.5 grams of salt) is a lot of salt required to increase systolic blood presssure “only” .43 mmHg. Some explanations include measurement error, missing data, and model misspecification. Lumley examines the last of these by displaying the model diagnostic plots. Model misspecification can be examined by identifying any association between the partial residuals and the observed sodium intake. I replicate Lumley’s code below.
par(mfrow = c(1, 2))
plot(as.vector(predict(model1)), resid(model1),
xlab = "Fitted Values", ylab = "Residuals"
)
nonmissing <- des[-model1$na.action]
plot(nonmissing$variables$sodium,
resid(model1, "partial")[, 1],
xlab = "Sodium",
ylab = "Partial Residuals"
)
nonmissing <- des[-model1$na.action]
par(mfrow = c(1, 2))
plot(model1, panel = make.panel.svysmooth(nonmissing))
termplot(model1,
data = model.frame(nonmissing),
partial = TRUE, se = TRUE, smooth = make.panel.svysmooth(nonmissing)
)
int1 <- svyglm(BPXSAR ~ (sodium + potassium) * I(RIDAGEYR - 40) + RIAGENDR + BMXBMI,
design = des
)
summary(int1)
##
## Call:
## svyglm(formula = BPXSAR ~ (sodium + potassium) * I(RIDAGEYR -
## 40) + RIAGENDR + BMXBMI, design = des)
##
## Survey design:
## update(des, sodium = DR1TSODI/1000, potassium = DR1TPOTA/1000)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.215191 1.240298 93.699 < 2e-16 ***
## sodium 0.309581 0.165746 1.868 0.074583 .
## potassium -0.975994 0.182598 -5.345 1.99e-05 ***
## I(RIDAGEYR - 40) 0.605278 0.022047 27.455 < 2e-16 ***
## RIAGENDR -3.516219 0.377502 -9.314 2.87e-09 ***
## BMXBMI 0.387957 0.038347 10.117 6.14e-10 ***
## sodium:I(RIDAGEYR - 40) -0.015707 0.008767 -1.792 0.086356 .
## potassium:I(RIDAGEYR - 40) -0.039575 0.010121 -3.910 0.000703 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 261.658)
##
## Number of Fisher Scoring iterations: 2
Lumley caps off this chapter asking whether we even need to bother
with the specialized weighting built into the survey
package. His answer is worth digging into:
Since regression models use adjustment for confounders as a way of removing distorted associations between exposure and response, it is plausible that a regression model might not need sampling weights.
A key assumption here is whether the population we’re estimating is stable across the population(s) represented by our data set. The naive, biased population that our sample represents when unweighted, or the “target” population our sample represents when re-weighted. A follow-up question would ask whether we could even estimate the relationship as desired, if the effect is heterogeneous across populations. Lumley cites (DuMouchel and Duncan 1983) for further discussion on this topic.
I have more to say here — Thinking of this article [gelman2007struggles] —, but it may not fit in these notes. For now I’ll end with Lumley’s two limitations for regression models when not using weights:
Some important variables used in constructing the weights may not be available,
Further, the important variables mentioned above may not be suitable for including in the model.
Lumley urges caution in this regard, advising that even a small amount of bias introduced from not including the weights may make any potential increase in precision that comes from not using the weights as poor trade-off.
county_sample <- wa_crime_df %>%
distinct(County) %>%
slice_sample(n = 10) %>%
pull(County)
wa_crime_df %>%
filter(County %in% county_sample) %>%
as_survey_design() %>%
summarize(
crime = survey_total(murder_and_crime)
)
## # A tibble: 1 × 2
## crime crime_se
## <dbl> <dbl>
## 1 8644 2238.
ratio_estimate <- wa_crime_df %>%
filter(County %in% county_sample) %>%
as_survey_design() %>%
svyratio(~murder_and_crime, ~Population, design = .)
predict(ratio_estimate, total = sum(wa_crime_df$Population))
## $total
## Population
## murder_and_crime 56678.22
##
## $se
## Population
## murder_and_crime 5888.009
wa_crime_03_df <- readxl::read_xlsx("data/WA_crime/1984-2011.xlsx", skip = 4) %>%
filter(Year == "2003", Population > 0) %>%
mutate(
murder_and_crime = `Murder Total` + `Burglary Total`,
state_pop = sum(Population),
County = stringr::str_to_lower(County),
num_counties = n_distinct(County),
) %>%
group_by(County) %>%
mutate(num_agencies = n_distinct(Agency)) %>%
ungroup() %>%
select(
County, Agency, Population, murder_and_crime,
num_counties, num_agencies
)
model_df <- wa_crime_03_df %>%
mutate(year = "2003") %>%
bind_rows(wa_crime_df %>% mutate(year = "2004")) %>%
mutate(
# We'll use 2004's numbers here for the fpc.
num_counties = if_else(year == "2004", num_counties, 0),
num_agencies = if_else(year == "2004", num_agencies, 0),
) %>%
spread(year, murder_and_crime) %>%
group_by(County, Agency) %>%
summarize(
num_counties = sum(num_counties),
num_agencies = sum(num_agencies),
`2003` = sum(replace_na(`2003`, 0)),
`2004` = sum(replace_na(`2004`, 0))
) %>%
ungroup() %>%
filter(
# Agencies were removed between 2003 and 2004.
num_counties > 0, num_agencies > 0,
County %in% county_sample
)
model_design <- model_df %>%
as_survey_design(
id = c(County, Agency),
fpc = c(num_counties, num_agencies)
)
fit <- svyglm(`2004` ~ `2003`, design = model_design)
total_matrix <- c(sum(wa_crime_03_df$murder_and_crime))
total_matrix <- as.data.frame(total_matrix)
names(total_matrix) <- "2003"
predict(fit, newdata = total_matrix)
## link SE
## 1 52874 1138.2
We can’t use a ratio estimator here because we’re not using the total population of the year in question as the denominator, we’re only looking at the total number of murders and burglaries in the previous year and relating it to the next, without measuring the total population, explicitly.
smaller_county_sample <- wa_crime_df %>%
distinct(County) %>%
filter(County != "king") %>%
slice_sample(n = 5) %>%
pull(County)
county_list <- unique(wa_crime_df$County)
design <- wa_crime_df %>%
filter(County == "king" | County %in% smaller_county_sample) %>%
mutate(
strata_label = if_else(County == "king", "King County", "WA Counties"),
num_counties = if_else(County == "king", 1, length(county_list) - 1)
) %>%
as_survey_design(
ids = c(County, Agency),
fpc = c(num_counties, num_agencies),
strata = strata_label
)
strata_totals <- wa_crime_df %>%
mutate(strata = if_else(County == "king", "King", "WA Counties")) %>%
group_by(strata) %>%
summarize(Population = sum(Population)) %>%
spread(strata, Population) %>%
as.matrix()
separate_estimator <- svyratio(~murder_and_crime, ~Population, design,
separate = TRUE
)
common_estimator <- svyratio(~murder_and_crime, ~Population, design)
predict(separate_estimator, total = strata_totals)
## $total
## Population
## murder_and_crime 68697.7
##
## $se
## Population
## murder_and_crime 2141.227
predict(common_estimator, total = sum(wa_crime_df$Population))
## $total
## Population
## murder_and_crime 68915.53
##
## $se
## Population
## murder_and_crime 3370.313
king_districts <- wa_crime_df %>%
filter(County == "king") %>%
pull(Agency)
sampled_king_districts <- sample(king_districts, 5)
sampled_counties <- sample(county_list, 5)
design <- wa_crime_df %>%
filter(County %in% sampled_counties | Agency %in% sampled_king_districts) %>%
mutate(
strata_label = if_else(County == "king", "King County", "WA Counties"),
num_counties = if_else(County == "king", 1, length(county_list) - 1),
) %>%
as_survey_design(
id = c(County, Agency),
fpc = c(num_counties, num_agencies),
strata = strata_label
)
separate_estimator <- svyratio(~murder_and_crime, ~Population, design,
separate = TRUE
)
common_estimator <- svyratio(~murder_and_crime, ~Population, design)
predict(separate_estimator, total = strata_totals)
## $total
## Population
## murder_and_crime 61185.56
##
## $se
## Population
## murder_and_crime 5576.053
predict(common_estimator, total = sum(wa_crime_df$Population))
## $total
## Population
## murder_and_crime 61727.29
##
## $se
## Population
## murder_and_crime 8328.097
sampled_king_districts <- sample(king_districts, 5)
sampled_counties <- sample(county_list, 5)
wa_crime_df %>%
filter(County %in% sampled_counties | Agency %in% sampled_king_districts) %>%
mutate(
strata_label = if_else(County == "king", "King County", "WA Counties"),
num_counties = if_else(County == "king", 1, length(county_list) - 1),
) %>%
as_survey_design(
id = c(County, Agency),
fpc = c(num_counties, num_agencies),
strata = strata_label
) %>%
summarize(
violent_non_violent = survey_ratio(violent_crime, property_crime)
)
## # A tibble: 1 × 2
## violent_non_violent violent_non_violent_se
## <dbl> <dbl>
## 1 0.0608 0.00749
round(sum(wa_crime_df$violent_crime) / sum(wa_crime_df$property_crime), 2)
## [1] 0.07
We can see that the estimate is quite close to the population value
tmthrnt) and total household income
(thtrninc) over the whole population and over the
subpopulation who pay rent.I think Lumley may have an error here when he says that
thtrninc is the total household monthly income - earlier we
used thtotinc for this measure as he had increating Figure
3.9. Consequently, I use thtotinc below.
sipp_hh_sub %>%
# Total population
summarize(
ratio_of_monthly_rent_to_household_income = survey_ratio(tmthrnt, thtotinc)
)
## # A tibble: 1 × 2
## ratio_of_monthly_rent_to_household_income ratio_of_monthly_rent_to_household…¹
## <dbl> <dbl>
## 1 0.00236 0.0000800
## # ℹ abbreviated name: ¹ratio_of_monthly_rent_to_household_income_se
sipp_hh_sub %>%
filter(tmthrnt > 0) %>%
# Rent paying subpopulation
summarize(
ratio_of_monthly_rent_to_household_income = survey_ratio(tmthrnt, thtotinc)
)
## # A tibble: 1 × 2
## ratio_of_monthly_rent_to_household_income ratio_of_monthly_rent_to_household…¹
## <dbl> <dbl>
## 1 0.209 0.00506
## # ℹ abbreviated name: ¹ratio_of_monthly_rent_to_household_income_se
# Full Population
sipp_hh_sub %>%
mutate(
# I also ran the numbers if we excluded those with 0 household rent
# and the estimates are effectively the same.
prop_income_rent = if_else(thtotinc == 0, 0, (tmthrnt / thtotinc)),
) %>%
summarize(
prop_income_rent_est = survey_mean(prop_income_rent)
)
## # A tibble: 1 × 2
## prop_income_rent_est prop_income_rent_est_se
## <dbl> <dbl>
## 1 0.0221 0.00624
sipp_hh_sub %>%
# Rent paying subpopulation
filter(tmthrnt > 0) %>%
mutate(
# I also ran the numbers if we excluded those with 0 household rent
# and the estimates are effectively the same.
prop_income_rent = if_else(thtotinc == 0, 0, (tmthrnt / thtotinc)),
) %>%
summarize(
prop_income_rent_est = survey_mean(prop_income_rent)
)
## # A tibble: 1 × 2
## prop_income_rent_est prop_income_rent_est_se
## <dbl> <dbl>
## 1 0.548 0.154
What are we to make of these estimates being different? Well that’s because they’re estimating two different things. As Lumley points out at the start of the chapter, one is a ratio of two population-level quantities, the other is the population estimate of a ratio measured at the individual level.
emer) affects academic
performance (as measured by 2000 API).Going off the web documentation
of the api dataset from the survey package, it
looks like there are a number of possible confounding variables should
be included in the model. Here is a list with a brief explanation: *
meals: The % of students eligible for subsidized meals.
This is a proxy for poverty and is likely correlated with the academic
achievement measured by the test.
hsg: percent of parents who are high school
graduates. Parental academic achievement is likely associated with their
students’ academic achievement.
avg.ed: Average parental education level - this
might be able to combine the above variable along with those who have
college or post-graduate education.
comp.imp refers to a school “growth” improvement
targets that may be related to students’ academic performance.
acs.k3 average class size years K-3 - class size is
often associated with academic performance There is a similar
acs.46 variable for grades 4-6.
ell The percent of english language learners. Since
most classes are typically instructed in english, a non-native english
speakers may struggle more with academic instruction.
full percent fully qualified teachers. A fully
qualified teacher will presumably be more capable of teaching than one
that’s not fully qualified.
enroll Enrollment may also be associated with the
API, if larger schools have access to greater resources, or inversely,
worse teacher to student ratios.
Should 1999 API be in the model (and why, or why not?)
emer measure, and may mask that
variable’s weaker impact. I’ll leave it out in my estimate to try to
avoid this problem.Do any of the confounding variables need to be transformed?
Does emer need to be transformed?
It doesn’t look like it needs to be to me. I include two plots below
that visualize the emer distribution (in the target
population) as well as its relationship with the api00
measure. While there is a right skew in the distribution, this doesn’t
strike me as problematic and I think a log transformation - an attempt
to fix the skew - would not be helpful both because of the difficulty in
handling 0 values as well as the log-scale interpretation.
It could be worth centering the emer value by the
estimated population mean to make the interpretation of the intercept
more valueable but I don’t think that’s necessary for an adequate model
interpretation here.
svyhist(~emer, strat_design)
svyplot(api00 ~ emer, strat_design,
main = "CA 2000 API Scores vs. Teacher Emergency Training Preparedness",
xlab = "Teachers with emergency training",
ylab = "API 2000"
)
fit <- svyglm(
api00 ~ emer + meals + hsg + avg.ed + comp.imp +
acs.k3 + acs.46 + ell + full + enroll,
design = strat_design
)
summary(fit)
##
## Call:
## svyglm(formula = api00 ~ emer + meals + hsg + avg.ed + comp.imp +
## acs.k3 + acs.46 + ell + full + enroll, design = strat_design)
##
## Survey design:
## svydesign(id = ~1, strata = ~stype, fpc = ~fpc, data = apistrat)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 652.826459 163.333076 3.997 0.000137 ***
## emer -0.726831 1.485803 -0.489 0.625986
## meals -2.378649 0.483856 -4.916 4.32e-06 ***
## hsg 0.424671 0.416138 1.021 0.310419
## avg.ed 48.006215 17.485877 2.745 0.007390 **
## comp.impYes 15.075954 14.397161 1.047 0.298035
## acs.k3 -3.366084 4.984776 -0.675 0.501357
## acs.46 2.989147 1.389796 2.151 0.034367 *
## ell -0.228954 0.407923 -0.561 0.576110
## full 0.006335 1.242588 0.005 0.995944
## enroll -0.042131 0.035810 -1.176 0.242720
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 3905.969)
##
## Number of Fisher Scoring iterations: 2
From the output above, we see that the emer value isn’t
found to be significantly associated (at \(\alpha = 0.05\)) with the 2000 API measure
after adjusting for other variables. Indeed, meals,
avg.ed and acs.46 are the only values for
which there is evidence to support a relationship at this level.
apipop) using the
glm() function.fit_pop <- glm(api00 ~ emer + meals + hsg + avg.ed + comp.imp +
acs.k3 + acs.46 + ell + full + enroll, data = apipop)
summary(fit_pop)
##
## Call:
## glm(formula = api00 ~ emer + meals + hsg + avg.ed + comp.imp +
## acs.k3 + acs.46 + ell + full + enroll, data = apipop)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 460.895349 21.325334 21.613 < 2e-16 ***
## emer 0.542863 0.171137 3.172 0.00152 **
## meals -2.180465 0.058205 -37.462 < 2e-16 ***
## hsg 0.209642 0.078151 2.683 0.00734 **
## avg.ed 50.288082 2.345033 21.445 < 2e-16 ***
## comp.impYes 30.659677 2.103305 14.577 < 2e-16 ***
## acs.k3 0.816373 0.556165 1.468 0.14222
## acs.46 0.863564 0.268385 3.218 0.00130 **
## ell -0.488302 0.064119 -7.616 3.25e-14 ***
## full 1.560638 0.152078 10.262 < 2e-16 ***
## enroll -0.035925 0.004889 -7.348 2.43e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2748.517)
##
## Null deviance: 70739140 on 4070 degrees of freedom
## Residual deviance: 11158979 on 4060 degrees of freedom
## (2123 observations deleted due to missingness)
## AIC: 43803
##
## Number of Fisher Scoring iterations: 2
Some of the sample estimates agree with the population data.
It is very clear that there is a whole lot more power to detect
associations when the entire population is present. In brief, we see
that the three variables for which an association was detected
previously — meals, acs.46 and
avg.edu — all have similar (within the sample based
estimate margin of error) estimates on the full population data.
Additionally, many other variables now have significant associations
that did not previously. Notably the emer variable has a
positive association with the api00 such that we’d expect
to see a .5 gain in API score for every additional percent gain in
teachers that are emergency qualified.
OneSimulation <- function() {
coefs <- apipop %>%
group_by(stype) %>%
slice_sample(n = 66) %>%
ungroup() %>%
as_survey_design(
strata = stype
) %>%
svyglm(api00 ~ emer + meals + hsg + avg.ed + comp.imp +
acs.k3 + acs.46 + ell + full + enroll, design = .) %>%
coef(.)
return(coefs)
}
coef_dist <- replicate(100, OneSimulation())
par(mfrow = c(1, 2))
hist(coef_dist[1, ], main = "Histogram of Intercept")
hist(coef_dist[2, ], main = "Histogram of emer")
Yes, as we’d expect the sampling distributions are roughly normal.
Given the discussion in the book’s example, sodium and potassium intake are likely confounders alongside the usual, age, sex race and socioeconomic status. That said, i don’t know if the last of these two are available given that the variable names in the dataset are not particularly descriptive.
See above - race and socioeconomic status stand out as two confounding variables that don’t appear to be measured in this dataset.
I’ll fit the same model as model2 in the text since
Lumley explains what the variables are in that model.
fit <- svyglm(BPXSAR ~ sodium + potassium + RIDAGEYR + RIAGENDR + BMXBMI,
design = des
)
summary(fit)
##
## Call:
## svyglm(formula = BPXSAR ~ sodium + potassium + RIDAGEYR + RIAGENDR +
## BMXBMI, design = des)
##
## Survey design:
## update(des, sodium = DR1TSODI/1000, potassium = DR1TPOTA/1000)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 97.32903 1.42668 68.220 < 2e-16 ***
## sodium 0.43458 0.16164 2.689 0.0126 *
## potassium -0.96119 0.17043 -5.640 7.19e-06 ***
## RIDAGEYR 0.45791 0.01080 42.380 < 2e-16 ***
## RIAGENDR -3.38208 0.38403 -8.807 3.90e-09 ***
## BMXBMI 0.38460 0.03797 10.129 2.48e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 263.5461)
##
## Number of Fisher Scoring iterations: 2
According to this model there is a .38 mm Hg expected increase in systolic blood pressure for every one unit increase in BMI after adjusting for age, sex, and daily sodium and potassium intake. In other words we’d expect a higher blood pressure amongst those with higher BMIs.
fit <- svyglm(BPXSAR ~ (RIDAGEYR + RIAGENDR) * I(BMXBMI - 25) + sodium + potassium,
design = des
)
summary(fit)
##
## Call:
## svyglm(formula = BPXSAR ~ (RIDAGEYR + RIAGENDR) * I(BMXBMI -
## 25) + sodium + potassium, design = des)
##
## Survey design:
## update(des, sodium = DR1TSODI/1000, potassium = DR1TPOTA/1000)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 107.178639 1.200310 89.292 < 2e-16 ***
## RIDAGEYR 0.462853 0.011268 41.076 < 2e-16 ***
## RIAGENDR -3.485444 0.353566 -9.858 1.00e-09 ***
## I(BMXBMI - 25) 0.510054 0.104374 4.887 6.18e-05 ***
## sodium 0.438810 0.163223 2.688 0.013119 *
## potassium -0.989246 0.169362 -5.841 5.95e-06 ***
## RIDAGEYR:I(BMXBMI - 25) -0.005031 0.001305 -3.855 0.000806 ***
## RIAGENDR:I(BMXBMI - 25) 0.041993 0.059821 0.702 0.489736
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 263.1093)
##
## Number of Fisher Scoring iterations: 2
I center the BMI variable at 25 (roughly the marginal average) and fit the model with the age and sex interactions with the centered BMI. The model fit shows that there’s a significant negative association with the BMI-age interaction and no association with the gender/sex - age interaction. This corresponds to a amplifying effect of age — the older you are the lower your blood pressure is, on average, a higher BMI will then also lead to a lower expected blood pressure in addition to this affect.
\[ \sum_{i=1}^{N} x_i(y_i - x_i\beta^*) = 0 \]
and \(R_i\) indicates that observation \(i\) is in the sample prove that
\[ E \left [ \sum_{i=1}^{N} R_ix_i (y_i - x_i \beta^*)\right ] = 0 \]
so that the unweighted sample estimating equations are unbiased.
TODO(apeterson91)
5.8 A rough approximation to the loss of efficiency from unnecessarily using weights can be constructed by considering the variance of the residuals in weighted and unweighted estimation. Assume as an approximation that the residuals \(r_i\) are independent of the sampling weights \(w_i\)
TODO(apeterson91)
TODO(apeterson91)
Chapter 6 covers regression models for discrete outcome data — binary
or categorical data. The survey package also handles
poisson and binomial regression though neither of these are covered in
this chapter.
Lumley gives a brief overview of logistic regression which is roughly
equivalent to what can be found on wikipedia,
so I won’t reiterate his points here. The main take home is that while
the interpretation of the coefficients change from a linear association
to an odds ratio association the only change to svyglm() is
adding the family = quasibinomial() option.
To demonstrate how to analyze binary data, Lumley uses the Scottish Household Survey data, examining internet use amongst the nation’s populace.
load("Data/SHS/shs.rda") # Lumley's website data
par(mfrow = c(2, 1))
bys <- svyby(~intuse, ~ age + sex, design = shs, svymean)
plot(
svysmooth(intuse ~ age,
design = subset(shs, sex == "male" & !is.na(age)),
bandwidth = 5
),
ylim = c(0, 0.8), ylab = "% Using Internet",
xlab = "Age"
)
lines(svysmooth(intuse ~ age,
design = subset(shs, sex == "female" & !is.na(age)),
bandwidth = 5
), lwd = 2, lty = 3)
points(bys$age, bys$intuse, pch = ifelse(bys$sex == "male", 19, 1))
legend("topright",
pch = c(19, 1), lty = c(1, 3), lwd = c(1, 2),
legend = c("Male", "Female"), bty = "n"
)
byinc <- svyby(~intuse, ~ sex + groupinc, design = shs, svymean)
barplot(byinc, xlab = "Income", ylab = "% Using Internet")
Since binary data can’t be easily visualized across a continuous variable very easily, the plots above use both smoothing and binning — computing the proportion point estimate within some range of age — to understand how internet use changes across these continuous dimensions.
We can see in the plot that internet use is lower amongst the older respondents in the survey and men ten to use the internet more then women, except perhaps the very youngest. My phrasing here is intentional, to demonstrate that the phenomenon we’re observing is likely a cohort effect since, as Lumley notes, its likely that born before the arrival of the internet are less likely to use it.
We see this same pattern in the boxplot showing internet use across income; men using internet more than women. We also see that those with higher incomes tend to use the internet more than those with lower incomes. In contrast to the cohort effect seen above, Lumley suggests that income may be more of a real effect, and those who start earning more may be more likely to use the internet.
Lumley then fits a series of models to quantify the relationships we’re seeing in the plots formally. I’ll summarize these briefly here and fit them below.
Model 1 estimates the log odds of internet as a linear (on the log odds scale) function of age, sex and their interaction.
Model 2 is the same as Model 1 except with two slopes for those younger and older than age 35, respectively — a low dimensional way to account for the nonlinear shape we observed previously.
Model 3 is the same as Model 2 but adds an additional fixed effect term for income.
Model 4 now adds an interaction between income and sex to account for the differences.
Lumley examines the output of all the models and for models 2 and 3
examines what the two age-slopes for the reference group (women) is by
using svycontrast(). Similarly with model 4, Lumley tests
whether the 5 additional parameters added as the result of the income /
sex interaction lead to better model fit. As we can see below, they
do.
m <- svyglm(intuse ~ I(age - 18) * sex, design = shs, family = quasibinomial())
m2 <- svyglm(intuse ~ (pmin(age, 35) + pmax(age, 35)) * sex,
design = shs,
family = quasibinomial()
)
summary(m)
##
## Call:
## svyglm(formula = intuse ~ I(age - 18) * sex, design = shs, family = quasibinomial())
##
## Survey design:
## svydesign(id = ~psu, strata = ~stratum, weight = ~grosswt, data = ex2)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.804113 0.047571 16.903 < 2e-16 ***
## I(age - 18) -0.044970 0.001382 -32.551 < 2e-16 ***
## sexfemale -0.116442 0.061748 -1.886 0.0594 .
## I(age - 18):sexfemale -0.010145 0.001864 -5.444 5.33e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.950831)
##
## Number of Fisher Scoring iterations: 4
summary(m2)
##
## Call:
## svyglm(formula = intuse ~ (pmin(age, 35) + pmax(age, 35)) * sex,
## design = shs, family = quasibinomial())
##
## Survey design:
## svydesign(id = ~psu, strata = ~stratum, weight = ~grosswt, data = ex2)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.152291 0.156772 13.729 < 2e-16 ***
## pmin(age, 35) 0.014055 0.005456 2.576 0.010003 *
## pmax(age, 35) -0.063366 0.001925 -32.922 < 2e-16 ***
## sexfemale 0.606718 0.211516 2.868 0.004133 **
## pmin(age, 35):sexfemale -0.017155 0.007294 -2.352 0.018691 *
## pmax(age, 35):sexfemale -0.009804 0.002587 -3.790 0.000151 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.9524217)
##
## Number of Fisher Scoring iterations: 5
svycontrast(m2, quote(`pmin(age, 35)` + `pmin(age, 35):sexfemale`))
## nlcon SE
## contrast -0.0031 0.0049
svycontrast(m2, quote(`pmax(age, 35)` + `pmax(age, 35):sexfemale`))
## nlcon SE
## contrast -0.07317 0.0018
shs <- update(shs, income = relevel(groupinc, ref = "under 10K"))
m3 <- svyglm(intuse ~ (pmin(age, 35) + pmax(age, 35)) * sex + income,
design = shs, family = quasibinomial()
)
summary(m3)
##
## Call:
## svyglm(formula = intuse ~ (pmin(age, 35) + pmax(age, 35)) * sex +
## income, design = shs, family = quasibinomial())
##
## Survey design:
## update(shs, income = relevel(groupinc, ref = "under 10K"))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.275691 0.179902 7.091 1.41e-12 ***
## pmin(age, 35) -0.009041 0.006170 -1.465 0.14286
## pmax(age, 35) -0.049408 0.002124 -23.259 < 2e-16 ***
## sexfemale 0.758883 0.235975 3.216 0.00130 **
## incomemissing 0.610892 0.117721 5.189 2.15e-07 ***
## income10-20K 0.533093 0.048473 10.998 < 2e-16 ***
## income20-30k 1.246396 0.052711 23.646 < 2e-16 ***
## income30-50k 2.197628 0.063644 34.530 < 2e-16 ***
## income50K+ 2.797022 0.132077 21.177 < 2e-16 ***
## pmin(age, 35):sexfemale -0.023225 0.008137 -2.854 0.00432 **
## pmax(age, 35):sexfemale -0.008103 0.002858 -2.835 0.00459 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.9574657)
##
## Number of Fisher Scoring iterations: 5
m4 <- svyglm(intuse ~ (pmin(age, 35) + pmax(age, 35)) * sex + income * sex,
design = shs, family = quasibinomial()
)
regTermTest(m4, ~ income:sex)
## Wald test for income:sex
## in svyglm(formula = intuse ~ (pmin(age, 35) + pmax(age, 35)) * sex +
## income * sex, design = shs, family = quasibinomial())
## F = 1.872811 on 5 and 11641 df: p= 0.095485
Double check Lumley’s assertions made on linear vs. logistic regression here.
Logistic regression gets its name from the “link” function that
determines the scale on which outcome mean is modeled and estimated. For
a logistic function the link is the “logit” function, which models the
log odds. Other link functions are also possible. In particular, using
the log() link function models the relative risk i.e. \(\log(P(Y=1)) = X\beta\).
Lumley goes into the details of how to fit these models via
svyglm and the potential difficulties of estimating the
relative risk between the two distributional families. Of note, because
the binomial family is more restrictive, estimation is more sensitive to
the fitting algorithm’s parameters starting values and
rr3 <- svyglm(intuse ~ (pmin(age, 35) + pmax(age, 35)) * sex + income,
design = shs, family = quasibinomial(log),
start = c(-0.5, rep(0, 10)))
rr4 <- svyglm(intuse ~ (pmin(age, 35) + pmax(age, 35)) * sex + income,
design = shs, family = quasipoisson(log))
summary(rr3)
##
## Call:
## svyglm(formula = intuse ~ (pmin(age, 35) + pmax(age, 35)) * sex +
## income, design = shs, family = quasibinomial(log), start = c(-0.5,
## rep(0, 10)))
##
## Survey design:
## update(shs, income = relevel(groupinc, ref = "under 10K"))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.455448 0.092857 -4.905 9.48e-07 ***
## pmin(age, 35) 0.002853 0.003017 0.946 0.344
## pmax(age, 35) -0.026658 0.001260 -21.149 < 2e-16 ***
## sexfemale 0.623713 0.112658 5.536 3.16e-08 ***
## incomemissing 0.489194 0.079296 6.169 7.09e-10 ***
## income10-20K 0.416094 0.036710 11.335 < 2e-16 ***
## income20-30k 0.820195 0.036838 22.265 < 2e-16 ***
## income30-50k 1.124631 0.037188 30.241 < 2e-16 ***
## income50K+ 1.247014 0.044928 27.756 < 2e-16 ***
## pmin(age, 35):sexfemale -0.007584 0.003801 -1.995 0.046 *
## pmax(age, 35):sexfemale -0.012816 0.001721 -7.446 1.03e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.9256337)
##
## Number of Fisher Scoring iterations: 16
summary(rr4)
##
## Call:
## svyglm(formula = intuse ~ (pmin(age, 35) + pmax(age, 35)) * sex +
## income, design = shs, family = quasipoisson(log))
##
## Survey design:
## update(shs, income = relevel(groupinc, ref = "under 10K"))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1679519 0.0846956 -1.983 0.04739 *
## pmin(age, 35) -0.0001441 0.0024776 -0.058 0.95363
## pmax(age, 35) -0.0312057 0.0012654 -24.660 < 2e-16 ***
## sexfemale 0.5946038 0.1036091 5.739 9.77e-09 ***
## incomemissing 0.4673470 0.0791934 5.901 3.71e-09 ***
## income10-20K 0.4152093 0.0367406 11.301 < 2e-16 ***
## income20-30k 0.8348862 0.0369330 22.605 < 2e-16 ***
## income30-50k 1.2039269 0.0368608 32.661 < 2e-16 ***
## income50K+ 1.3601811 0.0429853 31.643 < 2e-16 ***
## pmin(age, 35):sexfemale -0.0087552 0.0033861 -2.586 0.00973 **
## pmax(age, 35):sexfemale -0.0116652 0.0017556 -6.645 3.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 0.5969912)
##
## Number of Fisher Scoring iterations: 5
The logit function introduced in the first section can be extended for use from binomial outcome data to categorical – typically ordinal, or ordered categorical data. Unfortunately, the data that Lumley uses for this section is not available so I’ll skip any notes on this, noting that the model interpretation for logit ordinal regression is the same as for a typical iid logit ordinal model.
## Data isn't available on lumley's website - or at least the data that is
# there associated with "nhanes3" does not contain the variables below.
dhanes <- svydesign(
id = ~SDPPSU6, strat = ~SDPSTRA6,
weight = ~WTPFHX6,
## nest = TRUE indicates the PSU identifier is nested
## within stratum - repeated across strata
nest = TRUE,
data = subset(nhanes3, !is.na(WTPFHX6))
)
dhanes <- update(dhanes, fpg = ifelse(phpfast >= 8 & gip < 8E3, gip, NA))
dhanes <- update(dhanes, diab = cut(fpg, c(0, 110, 125, Inf)),
diabi = cut(fpg, c(0, 100, 125, Inf)))
dhanes <- update(dhanes, cadmium = ifelse(upd < 88880, udp, NA),
creatinine = ifelse(urp < 88880, urp, NA))
dhanes <- update(dhanes,
age = ifelse(hsaitmor > 1000, NA, hsaitmor / 12))
model0 <- svyolr(diab ~ cadmium + creatine, design = dhanes)
In this subsection Lumley discusses the log-log link function. The
log log, probit, and cauchit or inverse cauchy link
function are all supported by the svyolr function and are
collectively referred to as cumulative link models.
Lumley introduces loglinear models in the context of the chi-square
tests (via svychisq()) introduced previously testing
association between count variables. These functions test a similar
hypothesis using different methods and I’d suggest looking at the
function documentation closely to determine if choosing between the two
would matter in a particular circumstance.
droplevels <- function(f) as.factor(as.character(f))
chis <- update(chis, smoking = droplevels(smoking), ins = droplevels(ins))
null <- svyloglin(~smoking + ins, chis)
# Dot below includes all previous variables
saturated <- update(null, ~.+smoking:ins)
anova(null, saturated)
## Analysis of Deviance Table
## Model 1: y ~ smoking + ins
## Model 2: y ~ smoking + ins + smoking:ins
## Deviance= 659.1779 p= 1.850244e-82
## Score= 694.3208 p= 9.032412e-168
The anova output above shows two p-values according to two test statistics, one the deviance and the other the score. Again, the difference here is quite “in the weeds” of how test statistic distributions are computed so I’d suggest consulting (Dobson and Barnett 2018) and/or the function documentation.
svychisq(~smoking + ins, chis)
##
## Pearson's X^2: Rao & Scott adjustment
##
## data: svychisq(~smoking + ins, chis)
## F = 130.11, ndf = 1.9923, ddf = 157.3884, p-value < 2.2e-16
pf(130.1143, 1.992, 157.338, lower.tail = FALSE)
## [1] 5.37419e-34
svychisq(~smoking + ins, chis, statistic = "Chisq")
##
## Pearson's X^2: Rao & Scott adjustment
##
## data: svychisq(~smoking + ins, chis, statistic = "Chisq")
## X-squared = 694.32, df = 2, p-value < 2.2e-16
summary(null)
## Loglinear model: svyloglin(~smoking + ins, chis)
## coef se p
## smoking1 -0.6209497 0.01251103 0.000000e+00
## smoking2 -0.1391308 0.01066059 6.275861e-39
## ins1 0.8246480 0.01015826 0.000000e+00
summary(saturated)
## Loglinear model: update(null, ~. + smoking:ins)
## coef se p
## smoking1 -0.4440871 0.01612925 7.065991e-167
## smoking2 -0.3086129 0.01735214 9.187439e-71
## ins1 0.8052182 0.01151648 0.000000e+00
## smoking1:ins1 -0.2821710 0.01847008 1.085166e-52
## smoking2:ins1 0.2510967 0.01875679 7.205759e-41
model.matrix(saturated)
## (Intercept) smoking1 smoking2 ins1 smoking1:ins1 smoking2:ins1
## 1 1 1 0 1 1 0
## 2 1 0 1 1 0 1
## 3 1 -1 -1 1 -1 -1
## 4 1 1 0 -1 -1 0
## 5 1 0 1 -1 0 -1
## 6 1 -1 -1 -1 1 1
## attr(,"assign")
## [1] 0 1 1 2 3 3
## attr(,"contrasts")
## attr(,"contrasts")$smoking
## [1] "contr.sum"
##
## attr(,"contrasts")$ins
## [1] "contr.sum"
Lumley explores model selection beyond just statistical testing by discussing graphical and hierarchical loglinear models. My sense is that the discussion here is not complete and I’d encourage any reader here to consult the references he points to for further reading on the topic.
Example: neck and back pain in NHIS
I couldn’t find the data associated with Lumley’s example on his website or from a cursory Google search for NHIS data. Furthermore, I found it difficult to parse some of Lumley’s words here. For example, for those reading the book, the two graphical models listed in Figure 6.11 look identical to me when the text states that the graphics are supposed to correspond to two different models.
The log linear models can be used to test linear association of ordered categorical variables by coding the variables with ordered numbers and then running the loglinear tests discussed previously. Lumley demonstrates with the Scottish Household Survey data.
shs <- update(shs, income = ifelse(groupinc == "missing", NA, groupinc))
null <- svyloglin(~rc5 + income, shs)
saturated <- update(null, ~.^2)
anova(null, saturated)
## Analysis of Deviance Table
## Model 1: y ~ rc5 + income
## Model 2: y ~ rc5 + income + rc5:income
## Deviance= 289.2165 p= 6.429997e-26
## Score= 282.5453 p= 0
lin <- update(null, ~. + as.numeric(rc5):as.numeric(income))
anova(null, lin)
## Analysis of Deviance Table
## Model 1: y ~ rc5 + income
## Model 2: y ~ rc5 + income + as.numeric(rc5):as.numeric(income)
## Deviance= 105.3609 p= 1.305691e-21
## Score= 105.6157 p= 1.152954e-21
anova(lin, saturated)
## Analysis of Deviance Table
## Model 1: y ~ rc5 + income + as.numeric(rc5):as.numeric(income)
## Model 2: y ~ rc5 + income + rc5:income
## Deviance= 183.8556 p= 4.346664e-14
## Score= 184.0606 p= 5.465958e-182
des <- svydesign(
id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~fouryearwt,
nest = TRUE, data = subset(nhanes, !is.na(WTDRD1))
)
des <- update(des, sodium = DR1TSODI / 1000, potassium = DR1TPOTA / 1000,
hypertension = (BPXSAR > 140 | BPXDAR > 90) * 1)
summary(svyglm(hypertension ~ sodium*potassium, design = des))
##
## Call:
## svyglm(formula = hypertension ~ sodium * potassium, design = des)
##
## Survey design:
## update(des, sodium = DR1TSODI/1000, potassium = DR1TPOTA/1000,
## hypertension = (BPXSAR > 140 | BPXDAR > 90) * 1)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.186932 0.014691 12.724 6.42e-13 ***
## sodium -0.021010 0.004979 -4.220 0.000247 ***
## potassium 0.002540 0.006563 0.387 0.701812
## sodium:potassium 0.001984 0.001188 1.670 0.106460
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1360893)
##
## Number of Fisher Scoring iterations: 2
We see sodium intake is negatively associated with hypertension while potassium is positively associated. Because we’re not adjusting for age like we saw previously, the sodium intake coefficient is biased up from its adjusted rate.
wa_crime_df
## # A tibble: 234 × 8
## County Agency Population murder_and_crime violent_crime property_crime
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 adams Adams SO 8910 120 29 332
## 2 adams Othello PD 6050 129 30 538
## 3 adams Ritzville PD 1740 16 2 77
## 4 asotin Asotin SO 13420 50 25 247
## 5 asotin Clarkston PD 7280 80 23 500
## 6 benton Benton SO 38645 231 80 799
## 7 benton Kennewick PD 58970 625 218 3229
## 8 benton Prosser PD 4985 35 11 215
## 9 benton Richland PD 42660 291 70 1403
## 10 benton West Richlan… 9840 49 10 161
## # ℹ 224 more rows
## # ℹ 2 more variables: num_counties <int>, num_agencies <int>
Fit a poisson regression (family=quasipoisson) to
model the relationship between number of murders and population. Poisson
regression fits a linear model to the logarithm of the mean of the
outcome variable. If the murder rate were constant, the optimal
transformation of the predictor variable would be the logarithm of
population, and its coefficient would be 1.0. Is this supported by the
data?
Predict the total number of murders in the statte using the Poisson regression model. Compare to a ratio estimator using population as the auxiliary variable.
Take simple random samples of five police districts from King County and five counties from the rest of the state. Fit a Poisson regression model with population and stratum (King County vs elsewhere) as predictors. Predict the total number of murders in the state.
MISEFFRT = 5It isn’t clear to what dataset Lumley is referring to here… so I’ll leave this question unanswered.
ins) and household
annual income (ak22_p). Important potential confounders
include age (srage_p), sex (srsex) and race
(racecen) and interactions may also be important.Using the California Health Interview Study 2005 data, fit a
relative risk regression model to the relationship between the
probability of having health insurance (ins) and household
annual income (ak22_p), age (srage_p), sex
(srsex) and race (racecen) . Compare the
coefficients to those from a logistic regression.
Using the same data as in Section 2.5.4, create an ordinal blood pressure variable based on systolic and diastolic pressure with categories “normal” (systolic < 120 and diastolic < 80), “prehypertension” (systolic < 140, diastolic < 90), “hypertension stage 1” (systolic < 160, diastolic < 100), and “hypertension stage 2” ( systolic at least 160 or diastolic at least 100). Fit proportional odds regression models to investigate the association between dietary sodium and potassium and hypertension.
Using data for Florida (X_STATE = 12) from the 2007
BRFSS, fit a loglinear model to associations between the following risk
factors and behaviors: perform vigorous physical activity
(VIGPACT), eat five or more servings of fruit or vegetables
per day (X_FV5RV), binge drinking of alcohol
(X_RFBING4), ever had an HIV test (HIVST5),
ever had Hepatitis B vaccine (HEPBVAC), age group
(X_AGE_G) and sex, X_SEXG. All except age
group are binary. You will need to remove missing values, coded 7,8 or
9.
Lumley motivates the need to explore the three titular topics by expanding on the principle developed in the second chapter — stratification. Similar as to how making use of the extra information available in strata we can improve estimates in straightforward estimation of totals and means, Lumley’s focus in this chapter is how to use the “auxiliary” information to adjust for non-response bias and improve the precision of the estimates.
Post-stratification is exactly what it sounds like - re-weighting estimates according to strata totals after or apart from any initial strata that might have been involved in the inital sampling design.
Consider a relatively straightforward design in which there’s a population of subjects of size \(N\) that can be partitioned into \(K\) mutually exclusive strata from which any of the \(N_k\) individuals in that strata can be sampled for \(n_k\) strata samples. In this setting the sampling weights for each individual in group \(k\) is \(\frac{N_k}{n_k}\) and \(N_k\) is known without any uncertainty.
If the sampling were not stratified but \(N_k\) were still known, the group sizes would not be exactly correct by simple Horvitz-Thompson estimation, but they could be corrected by re-weighting so that the sizes are correct as they would be in stratified sampling.
Specifically, take each weight \(w_i = \frac{1}{\pi_i}\) and construct new weights \(w_i^* = \frac{g_i}{\pi_i} = \frac{N_k}{\hat{N}_k} \times \frac{1}{\pi_i}\).
For estimating the group side of the kth group then, we’ll have
\[ n_k \times \frac{g_i}{\pi_i} = n_k \times \frac{1}{\pi_i} \times \frac{N_k}{\hat{N}_k} = n_k \times \frac{\hat{N}_k}{n_k} \times \frac{N_k}{\hat{N}_k} = N_k, \] where \(\pi_i = \frac{n_k}{\hat{N}_k}\). The consequence of this re-weighting means that the estimated sub group population is exactly correct and subsequent estimates within or across these groups benefit from the extra information.
Of course, as Lumley notes, there’s a problem if no entities were sampled in the particular strata of interest - you can’t re-weight the number 0. Still since this is unlikely to happen for groups and samples of “reasonable” size post-stratification is still a worthy strategy given the potential reductions in variance that are possible.
Lumley’s illustration of post-stratification looks at the two-stage sample drawn from the API population, with 40 school districts sampled from California and then up to 5 schools sampled from each district. Lumley uses this example to illustrate how improvements to precision can be made via post-stratification – or not.
We’ll start with a reminder of the sample design used here: a two-stage sample.
clus2_design
## 2 - level Cluster Sampling design
## With (40, 126) clusters.
## svydesign(id = ~dnum + snum, fpc = ~fpc1 + fpc2, data = apiclus2)
Then information about the population group sizes is included in the
call to postStratify() as well as the variable/strata
across which to post-stratify.
pop.types <- data.frame(stype = c("E", "H", "M"), Freq = c(4421, 755, 1018))
ps_design <- postStratify(clus2_design, strata = ~stype, population = pop.types)
ps_design
## 2 - level Cluster Sampling design
## With (40, 126) clusters.
## postStratify(clus2_design, strata = ~stype, population = pop.types)
Totals, and so on are then estimated in the usual fashion. In this example there’s a large difference in the variability of the estimated total when comparing the naive and post-stratified estimates because much of the variability in the number of students enrolled can be explained by the school type - elementary schools are typically smaller than middle schools and highschools. By including more information about the types of schools in the overall population, the standard error is decreased by a factor of ~ 2.6.
svytotal(~enroll, clus2_design, na.rm = TRUE)
## total SE
## enroll 2639273 799638
svytotal(~enroll, ps_design, na.rm = TRUE)
## total SE
## enroll 3074076 292584
In contrast, school type is not associated with the variability in
the school scores measured by the Academic Performance Index - denoted
below by api00.
svymean(~api00, clus2_design)
## mean SE
## api00 670.81 30.099
svymean(~api00, ps_design)
## mean SE
## api00 673 28.832
Indeed, the score is specifically setup to be standardized across school types and as such there’s little variance reduction observed by using the post-stratification information in this instance.
Lumley notes that if the api dataset were a real survey, non response might vary as a function of school type and in which case post-stratification could help reduce non-response bias.
If one were to post-stratify using more than one variable would require the complete joint distribution of both variables. This can be problematic because the population totals for the joint distribution - or cross classification as Lumley calls it - is not available. Raking is a method that aims to overcome this problem.
The process involves post-stratifying on each set of variables in turn, and repeating this process until the weights stop changing.
Lumley further highlights the connection between log-linear regression models and raking:
Raking could be considered a form of post-stratification where a log-linear model is used to smooth out the sample and population tables before the weights are adjusted.
load("Data/Family Resource Survey/frs.rda")
frs.des <- svydesign(ids = ~PSU, data = frs)
pop.ctband <- data.frame(
CTBAND = 1:9,
Freq = c(
515672, 547548, 351599, 291425,
266257, 147851, 87767, 9190, 19670
)
)
pop.tenure <- data.frame(
TENURE = 1:4,
Freq = c(1459205, 493237, 128189, 156348)
)
frs.raked <- rake(frs.des,
sample = list(~CTBAND, ~TENURE),
population = list(pop.ctband, pop.tenure)
)
overall <- svymean(~HHINC, frs.raked)
with_children <- svymean(~HHINC, subset(frs.raked, DEPCHLDH > 0))
children_singleparent <- svymean(~HHINC, subset(frs.raked, DEPCHLDH > 0 & ADULTH == 1))
c(
"Overall" = overall,
"With Children" = with_children,
"Single Parent" = children_singleparent
)
## Overall.HHINC With Children.HHINC Single Parent.HHINC
## 475.2484 605.1928 282.7683
Lumley identifies two ways to understand post-stratification:
Post-stratification makes small changes to the sampling weights such that the estimated totals match the population totals.
Post-stratification is a regression estimator, where each post-stratum is an indicator in a regression model.
An extension of the first view leads to calibration estimators while the second leads to generalized regression estimators. The former requires correctly specifying how to change the weights, the second requires correctly specifying the model. Both can lead to great increases in precision of the target estimates.
Lumley’s motivation of calibration starts with the regression estimate of a population total: if we have the auxiliary variable \(X_i\) available on all units in the population and an estimated \(\hat{\beta}\) from a sample, we can estimate \(\hat{T}\) as the sum of the predicted values from the regression.
\[ \hat{T}_{reg} = \sum_{i=1}^N X_i\hat{\beta} \]
From this starting point, Lumley describes that calibration follows a similar form, with the unknown parameter \(\hat{\beta}\) know a function of the calibration weights \(g_i\) and original sampling weights \(\pi_i\) defined in such a way that the population total of \(X\) is equal to the estimated total of \(X\). Lumley identifies this as the calibration constraints:
\[ T_x = \sum_{i=1}^{n} \frac{g_i}{\pi_i} X_i. \]
However, this constraint alone will not uniquely identify the \(g_i\). Consequently, the specification of calibration weights are completed by requiring that they be “close” to the sampling weights - minimizing a distance function, while still satisfying the previous constraint.
Calibration provides a unified view of post-stratification and raking. As Lumley states:
In linear regression calibration, the calibration weights \(g_i\) are a linear function of the auxiliary variables; in raking calibration the calibration weights are a multiplicative function of the auxiliary variables.
Variance estimation proceeds in a similar fashion for calibration as it did for post-stratification, by constructing an unbiased estimator for the residual between \(Y_i\) and the estimated mean, or population total.
Linear regression calibration
We’re in a similar spot as before with regard to the election data example.
pop.size <- sum(pop.ctband$Freq)
pop.totals <- c(
"(intercept)" = pop.size, pop.ctband$Freq[-1],
pop.tenure$Freq[-1]
)
frs.cal <- calibrate(frs.des,
formula = ~ factor(CTBAND) + factor(TENURE),
population = pop.totals,
calfun = "raking"
)
## Sample: [1] "(Intercept)" "factor(CTBAND)2" "factor(CTBAND)3" "factor(CTBAND)4"
## [5] "factor(CTBAND)5" "factor(CTBAND)6" "factor(CTBAND)7" "factor(CTBAND)8"
## [9] "factor(CTBAND)9" "factor(TENURE)2" "factor(TENURE)3" "factor(TENURE)4"
## Popltn: [1] "(intercept)" "" "" "" ""
## [6] "" "" "" "" ""
## [11] "" ""
svymean(~HHINC, frs.cal)
## mean SE
## HHINC 475.25 5.7306
svymean(~HHINC, subset(frs.cal, DEPCHLDH > 0))
## mean SE
## HHINC 605.19 11.221
svymean(~HHINC, subset(frs.cal, DEPCHLDH > 0 & ADULTH == 1))
## mean SE
## HHINC 282.77 9.3403
clus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
logit_cal <- calibrate(clus1, ~ stype + api99,
population = c(6194, 755, 1018, 3914069),
calfun = "logit", bounds = c(0.7, 1.7)
)
svymean(~api00, clus1)
## mean SE
## api00 644.17 23.542
svymean(~api00, logit_cal)
## mean SE
## api00 665.46 3.42
m0 <- svyglm(api00 ~ ell + mobility + emer, clus1)
summary(svyglm(api00 ~ ell + mobility + emer, clus1))
##
## Call:
## svyglm(formula = api00 ~ ell + mobility + emer, design = clus1)
##
## Survey design:
## svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 780.4595 30.0210 25.997 3.16e-11 ***
## ell -3.2979 0.4689 -7.033 2.17e-05 ***
## mobility -1.4454 0.7343 -1.968 0.07474 .
## emer -1.8142 0.4234 -4.285 0.00129 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 6628.496)
##
## Number of Fisher Scoring iterations: 2
m1 <- svyglm(api00 ~ ell + mobility + emer, logit_cal)
summary(svyglm(api00 ~ ell + mobility + emer, logit_cal))
##
## Call:
## svyglm(formula = api00 ~ ell + mobility + emer, design = logit_cal)
##
## Survey design:
## calibrate(clus1, ~stype + api99, population = c(6194, 755, 1018,
## 3914069), calfun = "logit", bounds = c(0.7, 1.7))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 789.1015 17.7622 44.426 9.18e-14 ***
## ell -3.2425 0.4803 -6.751 3.15e-05 ***
## mobility -1.5140 0.6436 -2.352 0.038318 *
## emer -1.7793 0.3824 -4.653 0.000702 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 7034.423)
##
## Number of Fisher Scoring iterations: 2
predict(m0, newdata = data.frame(ell = 5, mobility = 10, emer = 10))
## link SE
## 1 731.37 26.402
predict(m1, newdata = data.frame(ell = 5, mobility = 10, emer = 10))
## link SE
## 1 739.96 15.02
Although altering the weights can lead to higher precision of
estimates sampling weights applied to clusters - which are identical -
can then lead to unintuitive results – Lumley gives the example of
mothers and infants sampled together but the number of infants not
adding up to the number of mothers when using the calibrated weights.
Consequently, Lumley includes an option in the survey
package to force within-cluster weights to be identical when calibrated
at a given stage — denoted below by the aggregate.stage
argument.
To illustrate Lumley uses the clus2_design from earlier
in the chapter, where the design was post-stratified on school type. I
don’t really follow Lumley’s argument here but he says that the there is
a loss of precision because of the added constraints, but that’s not
what we see below or in the text - the constrained calibration has a
lower standard error than the post-stratified design… Its hard to
inspect things too carefully here because Lumley uses a cal
variable that’s assigned a calibration object, presumably, but does not
show how he assigns it in the text, so we’re left wondering.
cal2 <- calibrate(clus2_design, ~stype,
pop = c(4421 + 755 + 1018, 755, 1018),
aggregate.stage = 1
)
svytotal(~enroll, cal2, na.rm = TRUE)
## total SE
## enroll 3084777 246321
svytotal(~enroll, ps_design, na.rm = TRUE)
## total SE
## enroll 3074076 292584
range(weights(cal) / weights(clus2_design))
range(weights(cal2) / weights(clus2_design))
## [1] 0.6418424 1.6764125
Lumley walks through a classic story(Basu
2011) in statistics about how
poor use of auxiliary information can lead to unreasonable behavior of
design- based inference. Lumley then goes on to show how calibrated
weights could make this more efficient. I’ll summarize briefly
below.
TODO(apeterson91): Add summary of Basu’s elephant and lessons learned.
All the methods discussed thus far this chapter can be used to reduce the bias from missing data, specifically what Lumley calls unit non-response. The phenomenon when the sampled unit cannot be measured, for example a person sampled for a telephone interview does not pick up the phone.
TODO(apeterson91): finish this
In comparison to multistage sampling, where individuals or clusters were sampled independently of one another across stages, multiphase sampling is a design where individuals are sampled dependent upon information obtained in the first sample. Consequently, instead of having two sampling probabilities which are multiplied by each other \(\pi_1 \times \pi_2\) we have conditional weights \(\pi_1\) and \(\pi_{2|1}\) which describe the probability of an entity being sampled in phase one and the the probability of a phase being sampled in stage two, conditional on being sampled in phase one.
nwts <- nwtsco
set.seed(1337)
subsample <- with(nwts, c(which(relaps == 1 | instit == 1),
sample(which(relaps == 0 & instit == 0), 499)))
nwts$in.subsample <- (1:nrow(nwts)) %in% subsample
nwts_design <- twophase(id = list(~1, ~1), subset = ~in.subsample,
strata = list(NULL, ~interaction(instit, relaps)),
data = nwts)
nwts_design
## Two-phase sparse-matrix design:
## twophase2(id = id, strata = strata, probs = probs, fpc = fpc,
## subset = subset, data = data, pps = pps)
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Stratified Independent Sampling design
## svydesign(ids = ~1, strata = ~interaction(instit, relaps), fpc = `*phase1*`)
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## Random number generation:
## RNG: Mersenne-Twister
## Normal: Inversion
## Sample: Rounding
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Denver
## tzcode source: internal
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] addhazard_1.1.0 sf_1.0-17 quantreg_5.98 SparseM_1.84-2
## [5] lattice_0.22-6 DiagrammeR_1.0.11 RSQLite_2.3.7 patchwork_1.2.0
## [9] survey_4.4-2 survival_3.6-4 Matrix_1.7-0 srvyr_1.3.0.9000
## [13] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [17] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
## [21] ggplot2_3.5.1 tidyverse_2.0.0 gt_0.11.0 haven_2.5.4
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.2 blob_1.2.4 fastmap_1.2.0
## [5] digest_0.6.36 timechange_0.3.0 lifecycle_1.0.4 magrittr_2.0.3
## [9] compiler_4.4.1 rlang_1.1.4 sass_0.4.9 tools_4.4.1
## [13] utf8_1.2.4 yaml_2.3.9 ahaz_1.15 knitr_1.47
## [17] labeling_0.4.3 htmlwidgets_1.6.4 bit_4.0.5 classInt_0.4-10
## [21] xml2_1.3.6 RColorBrewer_1.1-3 KernSmooth_2.23-24 withr_3.0.0
## [25] fansi_1.0.6 e1071_1.7-14 colorspace_2.1-0 scales_1.3.0
## [29] MASS_7.3-60.2 cli_3.6.3 rmarkdown_2.27 generics_0.1.3
## [33] rstudioapi_0.16.0 tzdb_0.4.0 sampling_2.10 visNetwork_2.1.2
## [37] readxl_1.4.3 DBI_1.2.3 cachem_1.1.0 proxy_0.4-27
## [41] splines_4.4.1 cellranger_1.1.0 mitools_2.4 vctrs_0.6.5
## [45] jsonlite_1.8.8 hms_1.1.3 bit64_4.0.5 hexbin_1.28.4
## [49] jquerylib_0.1.4 units_0.8-5 glue_1.7.0 stringi_1.8.4
## [53] gtable_0.3.5 munsell_0.5.1 pillar_1.9.0 htmltools_0.5.8.1
## [57] dbplyr_2.5.0 R6_2.5.1 evaluate_0.24.0 lpSolve_5.6.20
## [61] highr_0.11 memoise_2.0.1 bslib_0.7.0 class_7.3-22
## [65] MatrixModels_0.5-3 Rcpp_1.0.12 xfun_0.45 pkgconfig_2.0.3